Lessons from a year as a data scientist

Twelve months ago, I made a career move from data analytics into the world of data science. In this post, I’ll cover the key learnings from my first year in the job, which has included building and deploying my first machine learning model as a junior data scientist.

Image for post
Image for post
Photo by Carlos Muza on Unsplash

Business Knowledge and Context

The first learning is something that I’ve read on countless blogs by data scientists and will repeat here due to its importance. Having the business context of the problem you are trying to solve with your model is incredibly important because it will determine the fundamental logic that you build into and around your model. More specifically, it can determine the way that you transform your data for training and scoring and the features you create (utilise your stakeholders’ vast business knowledge for this!), and the way you define your target variable. If these do not align with your business rules, your model will not be relevant and you won’t be giving your stakeholders the best possible results. Additionally, these considerations may lead you to conclude that it won’t be possible for you to solve the business problem using machine learning. For example, maybe the data you require is loaded overnight but your stakeholder requires real-time scoring. This would be an immediate blocker to your project.

In addition, a topic that should be discussed at the outset of the project is what your stakeholders are expecting from your model and how they are planning to use it. They may rely on your technical expertise to provide guidance here, as to explaining what is possible (and you, in turn, may rely on a data engineer to gauge what is possible). For example, if you are building a classification model, is your stakeholder expecting to receive the raw probability score from the model or do they have a preference to receive a classification based on a threshold of your choosing? Will the model output be used by further downstream systems and if so, how? The answers to these questions may have an impact on what features you consider for your model, and might also affect the decision as to what level of performance is required from your model in order to deem it feasible or worthwhile for your stakeholder. For example, imagine that the primary aim of the project is to capture every positive case in a particular scenario; you might tune your model to achieve a high recall score and then discuss with your stakeholders the maximum recall score you think the model could achieve in order to understand if that would be acceptable for the use case.

The skill of being able to understand the business need and the wider picture of how a model will be used by stakeholders is one that is highly valuable to employers, and is certainly something that should be highlighted on any data scientist’s CV.

Communication

The second lesson is the value of regular, honest communication between yourself and other members of your team or the wider business.

I have been lucky enough to benefit from mentoring which I have found invaluable. Through biweekly catch ups with a senior data scientist, I have been able to receive feedback (good and bad), share learnings and discuss suggestions of new areas of exploration. These sessions have given me the benefit of identifying areas for improvement both on a project basis and a personal development basis, and have allowed me a safe space to ask questions.

I also work closely with a data engineer, particularly on deploying models into production and maintaining models once they are live. Our regular communication has been key to achieving success with the stability of the models, and has also given me an insight into how our data science platform works under the hood. I’ve also started to learn more about software engineering practices, which has been useful for improving my code and making life easier for those who review my PRs.

In the wider data analytics team, it’s been particularly beneficial to keep up-to-date with the work of the data platform team. They are constantly working to improve the quality of data whilst also helping other teams to import new data points independently. It’s helpful to understand the changes that the database is going through as these changes can often impact our models, and can also offer us new features. Keeping up with any issues that take place has also proved vital, for example data outages can have a large impact on the performance of our production models.

Of course, maintaining an honest relationship with your stakeholders is hugely important for the success of your project. In my case, stakeholders have been really helpful in giving me a better understanding of their business problems, as well as helping to generate opportunities to apply machine learning to other problems in the business.

Testing is Key

It’s essential to understand whether your models are delivering value to the company and this is where testing comes in. My top testing tips are:

  • A/B testing is your friend. Not only should you be using A/B tests to pit a new version of a model against the old version, but you should also implement a control group which is untouched by machine learning. This dataset will provide a baseline in terms of the business metrics being observed in the test, and can also be used for training the model later in order to avoid a feedback loop.
  • Test code should be implemented in the cheapest way possible. The code will probably look horrible, but in the case that the experiment shows your changes have not had the desired effect on your business metrics, it means you’ve wasted as little time as possible writing this code. Of course, in the case that the experiment is successful, you should refactor the code to be production-ready.
  • Remember that many experiments in business will not succeed, but a failed experiment will often provide more learnings than a successful experiment as you uncover the insights as to why the result ended up this way.

No Model is Without its Flaws

With machine learning, it can be difficult to stop yourself from going into an iterative cycle that never ends — just one more feature to try out here, or another run of the hyperparameter tuning job, or an additional tweak to the business logic there. Of course, there has to be a point at which you are satisfied enough with your work that you are able to move onto the next step of the project, or a different project entirely. Here are some tips that I’ve found helpful in avoiding the rabbit hole:

  1. Learn to take a step back to evaluate the value of your time against the value of adding marginal gains to your model performance.
  2. Don’t feel pressured to try every single possible feature in the initial build, especially if certain features are complex to compute or difficult to obtain. Remember that you can include these in future iterations of the model once its basic value has been proven.
  3. Continue to monitor the performance of your model even after testing has finished. This sounds obvious but can sometimes be forgotten. You could set up an automated report to make it an effortless task. Make sure to be honest with yourself about what performance is acceptable for the business, and when the performance degrades, make time to investigate why.
  4. Accept that business strategy can and will evolve, and it may be necessary to review your model in the future in order to re-align with the objectives of the business.

Summary

After only one year in the job I still have plenty of technical skills to learn, but I believe that the lessons I’ve described here will be applicable throughout my career. Many of these are also relevant to other technical roles and relate to qualities that employers are often looking for.

Data analyst to data scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store