Solving the 80/20 Data Science Dilemma

WHAT IT IS, WHY IT’S NOT WORKING, AND HOW TO FIX IT

The 80/20 data science rule is a widely-accepted belief that data scientists spend 80% of their time finding, cleaning, and organizing data and only spend the remaining 20% of their time analyzing the data, developing algorithms, and building machine learning models, leading to significant inefficiencies throughout the data science industry.

Data analysis didn’t even exist 10 years ago but is now one of the top-ranking positions in the United States for two years running thanks to the rise of cloud. It has one of the highest salaries, job openings, and job satisfaction rates in the country. The Harvard Business Review called data analysis the “sexiest job of the 21 st century”. And it’s only expected to grow from here.

In 2020, demand is expected to grow by up to 2.7 million job openings. By 2026, the U.S. Bureau of Labor Statistics predicts a further 11.5 million new jobs including data engineers, data administrators, machine learning engineers, statisticians, and data and analytics managers.

The number of data science positions far exceeds the number of data scientists, so it’s imperative
that the 80/20 data science be addressed and remedied to improve efficiency, which requires the
right cloud tools.

Gathering and Analyzing Data

One of a data scientist’s duties is to identify relevant data sets within data lakes, which are their data storage repositories. They’re also responsible for data-sharing policy due to a lack of effective corporate-level policy. This data comes in from a multitude of streams via cloud-connected systems, and in most organizations, there’s no easy way to sift through it and determine what’s relevant or safe to share. Too often, data scientists are left waiting weeks for various internal departments to deliver
requested data and it often turns out that they don’t have it or the quality is low.

Once they finally receive the data, they need to analyze it. Depending on the format, there may be insufficient metadata, so the analyst will need to contact the owner. Then it’s time to format, clean, and sample, and in some cases perform scaling, decomposition, and aggregation transformation before they can start training the models.

Another issue in analysis efficiency is the organizational structure. Data scientists often perform their work as an isolated task, which backs up workloads, consumes resources, and increases the risk of error. A better approach would be to utilize cloud platforms to create a unified operational structure with automated data governance. This would increase efficiency and allow data scientists to collaborate with each other and with their developers.

Creating Data Models

The more data a model is exposed to, the more accurate the model is, so while the processes of finding and analyzing the data are tedious, they are essential. It’s best for a data scientist to gather and analyze as much data as possible when forming a model. Unrealistic deadlines can force data scientists to compromise the model with sub-optimal data that provide sub-optimal results. Errors in model development can cause significantly different output and render the model useless.

In order to produce a good model in the given time constraints, they are typically only able to develop
one model at a time. If there’s an error, they have to start over, setting back the entire bottleneck of
models to be developed. Simply put, data scientists are expected to produce accurate data models
using incomplete and low-quality data on short timelines, and it is just not realistic.

How to Improve Data Analysis Efficiency

Cloud data services can automate many of the tedious and time-consuming processes that data scientists have to go through before getting to the analysis part of data analysis. By using these services to improve efficiency, the quality of the data won’t be compromised. This means analysts can produce accurate data models that can serve as working foundations for AI and cognitive applications.

Gathering Data

Intelligent search capabilities with metadata like tags and metrics can help data scientists find the data they need and determine if it’s relevant and valuable for a specific model. Data governance tools eliminate the need for data scientists and analysts to be responsible for data-sharing policy, giving
them the confidence to use data sets that they have permission to use. It also ensures that the models and results will be used responsibly as well.

Training Models

These automations will free up a significant portion of the data analyst & time, allowing him or her to train several models at once. In addition to being more efficient, this freedom will facilitate experimentation and innovation by eliminating the need to laser-focus resources on a single approach that may not end up being valuable.

Transfer Learning

Transfer learning is the preservation of knowledge gained in one endeavor and its application to a related endeavor. It’s a popular topic in the machine learning industry. Cloud platforms can enable data scientists to save and extend models and use existing assets to frame new projects so that each
project doesn’t have to be built from nothing.

Visualization

Data science tools can be used to develop visualizations that help to communicate how models work. They save time that would otherwise be spent communicating the utility of models and reduce risk.

The 90/10 Data Science Model

Data scientists are essential for innovation and competitive edge in today’s digital era, but as the demand for data analysis increases, the responsibility on the shoulders of our existing data scientists continues to be too heavy to allow for efficient performance.

By equipping your data science teams with japio cloud data tools you can automate many of the processes currently performed manually, we can eliminate the 80/20 data science dilemma and introduce a 90/10 efficiency model in which data analysts spend 90% of their time carrying out their more enjoyable and productive duties.

Request a Demo