How to Set-up Your Data Science Stack on a Budget

Mar 11, 2022•6 min read

Languages, frameworks, tools, and trends
Skills, interviews, and jobs

Companies of all sizes and in all sectors are rapidly realizing that to stay competitive, they must adopt a culture where data-driven decisions are made quickly across operations. It could be at the CEO level where BI dashboards monitor overall business health and performance, for analysts querying data from numerous sources for insights, or for engineers implementing machine learning (ML) tools and intelligent automation applications on top of corporate data.

Data science has progressed to the point that no organization can afford to disregard it. And the first order of business for any organization looking to establish a data science stack is good budget management. The data science stack will include machine learning that the organization will use to serve their consumers better as well as gain insights into their own operations to supplement the upper management team.

Since data science and business operations are so closely linked, selecting the correct stack for data architecture is critical. In addition, having the proper tools can help reduce marketing time, development expenses, infrastructure costs, and improve overall platform stability.

The framework for creating models and the runtime for inference jobs are part of the data science tech stack. It encompasses the whole data engineering pipeline, business intelligence tools, and model deployment methods.

This article discusses the important variables to consider while constructing a data science tech stack on a budget. But before that, here’s a quick look at what a data science stack entails.

Overview of data science stack

Data comes into a data lake from multiple sources in a typical company design. A data lake is a heterogeneous data storage region where various types of data, including transactional database data, are kept, regardless of their format or source. After that, data is extracted, converted, and placed into a data warehouse to be examined.

Data scientists and business analysts work on the data warehouse, creating reusable analytics modules and reports. Some of these modules are implemented using the data warehouse as their data source, and generate descriptive insights in batches. Another group of modules is tightly connected with transactional systems and provides real-time results. To aid in independent scaling and deployment, both models are typically served as web interfaces.

How to select components for a data science stack

Deciding which components to include in an analytics and data science stack involves several variables and a significant number of possible combinations. Here are some useful questions to ask before creating a data science stack.

Are you more comfortable with on-premise or cloud-based services?
Do you have the necessary programming skills to build your models and analytics functions?
Have you ever invested in a cloud service provider?
Do you think there's a business case for real-time data intake and analytics?

Now that the components have been covered, here are the elements to consider when choosing a stack for crucial points in the flow.

1. Warehouse of data

The type of data warehouse you choose is mostly determined by whether you want an on-premise or cloud-based solution. The maintenance-free nature of cloud-based software as a service (SaaS) solution and the ability to focus on the core analytics problem without being sidetracked are obvious benefits.

An execution engine like Spark or TEZ, with a querying layer like Hive or Presto on top, is the most common on-premise option. The main benefit of this is that you have total control over your data. Custom code may be used to create analytics and ML modules directly in Spark. Basic ML methods are already incorporated into querying engines like Presto.

If your company lacks the programming ability to maintain such systems and has no plans to acquire them, cloud-based services like Redshift, Azure Data Warehouse or BigQuery may be a better option. They can use the ML modules already included in the package.

Redshift ML is a newcomer in the market but BigQuery ML has been around for a while. So, if you want to develop ML models directly from your cloud data warehouse, BigQuery and Azure ML may be the more dependable options than Amazon Web Services (AWS).

2. ETL (Extract, Transform, Load)

Any machine learning model or analytics module is only as good as the features it receives as input. The ETL tool is the one in charge of generating these input characteristics. Spark-based transformation functions utilizing custom code or Spark SQL in Python or Scala are common choices for on-premise solutions.

To guarantee that the feature-building process is dependable, you'll need to create frameworks and schedulers. You can also use an open-source technology like Pentaho Data Integration, although these are less versatile than bespoke solutions.

If you're looking for a SaaS solution, Google Cloud Dataflow, Azure Databricks, and AWS Glue are great options. They enable data science modeling natively and allow for automated code development based on visual interfaces. A downside is that they are more aligned to their stack, such as Glue if you are using AWS and Databricks if you are using Azure. External cloud-based data sources are also underserved.

3. Tools for business intelligence and visualization

Business intelligence and visualization tools are integral to the data science tech stack because they are vital for exploratory data analysis. Tableau and Microsoft Power BI are two popular on-premise options. In addition, Python tools like Seaborn and Matplotlib are effective alternatives for displaying data if your development team needs specific code-based solutions.

In this arena, AWS QuickSight, Google Data Studio, and Azure Data Explorer are all strong SaaS options. Basic machine learning capabilities are also available in AWS QuickSight, which can be used to detect abnormalities, forecast values, and even construct autonomous dashboards. These services make sense if you're currently using their stack and aren't doing a good job of integrating data from other sources.

4. Frameworks for ML and analytics implementation

Python has long been the de facto standard for ML and analytics applications based on custom code. Scikit-learn and Statsmodels are popular choices for statistical analysis and modeling. R has many functions for statistical models and may be used in production. TensorFlow, MXNet, Pytorch, and other deep learning frameworks can also be utilized.

Deeplearning4j is an excellent alternative if Java is your preferred programming language. Since most developers would need to do a lot of study before finishing the model pipeline, community support is a key issue to consider. Most cloud service providers provide ML models and automated model creation as a service if your company does not want to hire ML experts or design bespoke models. You can construct models and intelligence without writing any code using Azure Machine Learning, Google Cloud AI, AWS Machine Learning Services, etc.

5. Stack for deployment

Once the models are created, they must be deployed for real-time or batch inferences. If you're using an on-premise configuration, you'll likely wrap the models in a web service framework like Flask or Django and deliver them using Docker containers. You can then use a container orchestration framework or a load balancer to scale them horizontally. The work and skill required are apparent decision factors.

Inference modules are complex and require the careful use of complicated concepts like batching, threading, etc., in order to get the optimum performance. However, ML frameworks like TensorFlow, MXNet, Pytorch, and others have their deployment mechanisms, which you should utilize rather than reinventing the wheel.

Using the ML serving choices supplied by cloud services is one way to avoid the difficult deployment procedure. AWS, GCP, and Azure feature built-in deployment mechanisms for their ML services as well as the ability to deploy bespoke models produced outside their platforms. The primary benefit of employing such services is that scalability is automatic.

As you can see, many things are taken into account when setting up a data science stack with a limited budget. Much depends on the size of the company as well as which routes it wants to take based on its unique needs.