Datascience in Towards Data Science on Medium,

Five Engineering Skills Every Data Scientist Should Learn

10/01/2024 Jesus Santana

Rounding out tactics to help you stay competitive as a “full stack” data scientist

Title card created by the author

As somebody who enjoys mentoring people to their fullest potential, I’ve had the sincere pleasure of mentoring many undergraduate students majoring in data science. What astounds me is how little engineering tactics are taught as a part of these programs. From students attending state schools to even Ivy League universities, I constantly hear that the emphasis is placed on pure data science skills. While these skills aren’t wrong by any means, it leaves a gaping hole in making a data scientist into a “full stack” data scientist.

By “full stack”, I don’t necessarily mean things like learning web development. What I mean specifically is being able to make your predictive model usable in a production setting. It’s one set of skills to know how to build the model; it’s another to know how to make it usable by others!

Fortunately, it’s my opinion that this is an easier thing to learn than pure data science work itself. You don’t necessarily need to be an expert in any of these skills, but having a foundational level of knowledge is important nevertheless. Depending on the company you end up working for as a data scientist, there may very well be the expectation that you know these basic engineering skills.

To help facilitate your learning journey, we’ll cover the basic gist of what these engineering skills are, and I’ll also provide links to learning resources to help you upskill in each of these areas! I’ll be sure to help you understand how each of these skills relates to your data science work. Also, if you struggle to understand any concept, I would highly recommend an LLM service like ChatGPT or Perplexity. Be careful not to necessarily use these services as a crutch, but if you don’t understand something, LLMs can be extremely useful in helping to facilitate your learning.

1. Basic Linux commands

Starting this off at number one for a reason: while most consumer computers run on Windows or macOS, basically the entire world runs on some form of Linux. You may have already gotten exposure to some of these tools as things like running Python commands require that you use a Linux interface. Pretty much every other skill in this post also relies on understanding basic Linux commands. Fortunately, you don’t need to be an expert in Linux to be effective with your data science work. The basics will do just fine!

Learning Resources:

2. Dependency management with virtual environments

You’ve probably heard the phrase, “Well, it runs on my computer just fine!” When completing your data science work using languages like Python, you may or may not be aware that everything you use is set to a specific version. Even if you are running the latest version of something today, that latest version will most likely become outdated. Using libraries like Pandas or Scikit-Learn may even introduce breaking changes, so you’ll want to ensure that you are managing your dependencies correctly.

The most effective way to do this is by creating a virtual environment on your computer. While there are many tools out there to help create a virtual environment, I would recommend sticking either with venv or conda. Both of these are very common and have a lot of support for knowing how to use them all across the internet. You may also want to consider documenting your dependencies in a small requirements.txt file. Understanding virtual environments will also play well into our next skill…

Learning Resources:

3. Docker

When it comes to deploying machine learning predictive models into production, most people these days make use of Docker containers. The simplest way to understand what Docker is is that is a specialized way to manage version dependency similar how we learned venv or conda from the previous skill. There is a difference between Docker and those other tools, but you definitely don’t need to understand what Docker is doing under the hood. (Technically speaking, you can use Docker as a virtual environment on your local computer, too, but most IDEs generally don’t support Docker as a virtual environment this way.)

Learning Resources:

4. Infrastructure as Code (e.g. Terraform, OpenTofu)

Many companies these days are choosing to not maintaining their own “on premises” infrastructure in favor of using cloud services like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP). While it is totally possible to deploy your predictive models using each service’s user interface (UI), this is not an ideal way of working. Doing things in the UI are time consuming and not very replicable. While each of these services offer its own form of “infrastructure as code” (IaC) option, many choose to use a more universal IaC like Terraform or OpenTofu. By leveraging IaC, you can reuse this code across multiple modeling efforts without having to “re-invent the wheel” every time.

Learning Resources:

5. CI/CD

Standing for “continuous integration / continuous delivery”, CI/CD answers the question of, “How do I get my predictive model from my local computer to a production server?” I list this one last because it pulls together all the skills we’ve covered so far. For example, you may have a CI/CD stage that builds your Docker container followed by another that deploys the Docker container to a cloud service using IaC. Of course, CI/CD isn’t limited to just these tactics. While we won’t cover them in this post, you may also want to include CI/CD stages for things like unit testing or code security scanning.

There are many different flavors of CI/CD tools out there, but fortunately, they all conceptually operate the same. The difference comes in the form of specific syntax, but if you generally understand how one CI/CD tool works, you could quickly pick up learning the syntax of another. Two of the most popular CI/CD tool options include GitHub Actions and GitLab.

Learning Resources:

That wraps up the skills for this post! Of course, you could certainly go beyond these skills to learn even more things like unit testing and code scanning, but I would encourage you first stick to learning these skills. As a reminder: these skills are actually expectations for many data scientist positions depending on the company! By upskilling in these areas, you solidify yourself as a competitive candidate for data scientist positions.


Five Engineering Skills Every Data Scientist Should Learn was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/swtl7A6
via IFTTT

También Podría Gustarte