AI Predictive Analytics: Our Approach for Building Multiple Regression Models

Imagine you’re tasked to bake a wedding cake. Your client has a clear vision: a two-tier cake, each tier 10 cm high, with a fluffy texture and a vibrant pink color. You want to get it right but don’t know how much flour, eggs, and sugar is needed to make each tier match the description. Plus, you are not sure about the baking temperature and the amount of time needed to make the cake. Simply put, you don't have a recipe to bake the cake.

Now let’s look at this challenge from a different perspective. Chemical companies face similar issues during the manufacturing process. Here is an example. Our client Polygal AG is a chemical enterprise manufacturing hydrocolloids with 100% renewable and plant-based raw materials. Each product they produce has a unique “recipe,” involving specific ingredients (chemical compounds) and precise configurations. These “recipes” are usually created by specialists such as chemists or technicians, who spend a significant amount of time working on each one.

But what if we passed this task to AI? Can we build an AI model to predict the production settings and exact amounts of the ingredients required to meet Polygal’s product specifications? And if so, what kind of AI do we need to do that?

LLM isn’t a panacea

LLMs like GPT-4 are primarily designed for natural language processing tasks, such as understanding, interpreting, and generating human language. But they are not the best fit for handling structured numerical data or performing regression analysis. Precisely the kind of problem we're looking to solve for Polygal.

Predicting exact ingredient quantities and production settings based on product configurations is a regression problem or, to be more specific, a Multivariate Multiple Regression (MMR) problem. It involves forecasting two or more outcomes (dependent variables) based on multiple input features (independent variables).

We believe that the MMR prediction model would be a better solution to meet Polygal’s requirements. Here are some details about the approach we would take to develop it.

Building MMR prediction model: steps involved

For Polygal, our objective is to implement an MMR prediction model that would help the company predict production configurations and the amounts of chemical compounds needed to produce hydrocolloids based on product specifications.

With the goals clearly outlined, the next step is to collect and prepare the data.

Data collection

Like many enterprises in the chemical manufacturing industry, Polygal has a vast amount of data, which is stored in Excel spreadsheets and databases. This data contains detailed product characteristics and performance attributes of chemical substances. We have to gather and label this information, including examples of both successful and unsuccessful recipes, along with the specific conditions under which they were produced. The more data you gather, the better the model will perform.

If you are looking to build a similar MMR solution, your dataset should include independent and dependent variables. Let’s briefly remind you what these are:

Independent variables, also known as predictors/features, are product specifications you feed the model with. For example, these can be the desired traits of the product such as color, texture, height, density, and purity.‍
Dependent variables (or outcomes/targets) are examples of recipes Polygal already uses including amounts of chemical compounds and production configurations.

Data pre-processing

As enterprise data might include errors, typos, and often comes in various formats, it can’t be used as it is and has to be pre-processed. To clean the dataset, we can implement such processes as feature engineering and feature reduction. Here is how they work to enhance the model performance.

Feature engineering

Feature engineering is the process of transforming raw data into a machine-readable format. It involves such techniques as feature imputation, feature encoding, and feature scaling.

Feature imputation

Feature imputation is a machine learning method that involves filling in missing values in a dataset to make it usable for training models. It solves the problem of incomplete data, which is a frequent problem in datasets.

A common way to use feature imputation is mean imputation. This method involves replacing missing values in a numerical feature with the mean (average) value of that feature calculated from the observed (non-missing) data.

You can also apply more sophisticated methods like k-nearest neighbors, where the value of a missing data point is calculated based on the values of its closest neighbors in the dataset.

If imputation is not feasible, you should remove records with excessive missing values.

Feature encoding

Real-world data often includes categorical features, like colors, labels, or types while ML algorithms require numerical input. To create numerical features from categorical variables for the model to interpret and process effectively, we can use one-hot encoding. It’s the process of converting categorical values into a series of binary columns. Each unique category becomes a new column, with a 1 or 0 indicating the presence of the category.

Suppose we have three unique categories, like “dark brown,” “light brown,” and “golden.” We create three columns with binary values indicating the presence of each color. Here is a table showing the one-hot encoding for three color categories. For example, the encoding for “light brown” is represented as [0, 1, 0].

Feature scaling

This preprocessing technique is needed to transform feature values to a similar scale. For example, Ploygal’s dataset includes the following features – raw material A quantity, raw material B quantity, reaction temperature, and reaction time, where:

raw material A quantity ranges from 5 to 500 kg
raw material B quantity ranges from 10 to 200 kg
reaction temperature ranges from 50 to 250°C
reaction time ranges from 1 to 10 hours

All these features are on different scales which might lead to issues where features with larger values disproportionately affect the prediction. To avoid these issues, we can use normalization and standardization techniques. Normalization (also known as min-max scaling) adjusts the range of feature values so that they fall between a specified minimum and maximum value, typically between 0 and 1. Standardization, on the other hand, scales numerical values to have a mean of 0 and a standard deviation of 1.Along with statistical techniques like calculating minimums and maximums in feature engineering, it’s essential to understand the business context and consult subject matter experts when required. For example, chemists and technicians at Polygal can offer crucial insights that our statistical analysis might miss, ensuring that the engineered features capture relevant information.

Feature reduction

Feature reduction, also known as dimensionality reduction, is a technique to reduce the number of input variables in a dataset while preserving as much information as possible.

This process is important because high-dimensional data can be challenging to work with, often leading to problems such as overfitting, where a model learns the noise in the training data instead of the actual patterns. The goal here is to distinguish between essential predictors and irrelevant metadata that may not significantly impact the model’s predictions. Techniques such as feature selection and feature extraction help achieve this objective.

Data splitting

Once the data is preprocessed, we start building the model by splitting data into three sets, which are the training set, the validation set, and the test set. Each set serves a specific purpose in training, tuning, and evaluating the model’s performance.

The model uses the training set to learn the patterns, relationships, and features from the data to make future predictions. It allows the model to adjust its parameters based on the examples it sees.

The validation set, in turn, is used to understand how a model performs and tune its parameters to improve its performance. It also acts as an unbiased dataset to compare the performance of different algorithms trained on the same training set.

Lastly, the test set is used to evaluate the final model’s performance on unseen data.

Model training

Before diving into model training, also known as model/algorithm fitting, we need to clearly define the algorithm we need to address our specific problem.

Choosing the right algorithm depends on the nature of the problem. For instance, classification tasks often use algorithms like Gradient Boosting Classifier or Random Forest Classifier, while multiple regression problems might employ methods such as Multiple Linear Regression, Ridge Regression, or Gradient Boosting Regression.

For the "recipe" use case we're exploring, multiple regression algorithms which predict outcomes using various input variables work best. But we always recommend selecting a few different algorithms and training them on the prepared dataset before making a decision which algorithm to go with.

Model tuning

After training, we need to evaluate how well the model performs using the known outputs from the validation set. It helps determine if the model is making accurate predictions or if it needs adjustments. Depending on the results, we can fine-tune its performance through hyperparameter tuning. For example, if the model is learning too slowly, we might increase the learning rate. This iterative process continues until acceptable results are achieved.

Beware of overfitting. It occurs when the model becomes too tailored to the training data, capturing noise instead of true patterns. In other words, the model might perform exceptionally well on the training data but struggle with new data because it has not learned to generalize the underlying patterns. Think of it like a student who memorizes a page from a textbook but fails to apply those concepts in a slightly different context.

To prevent overfitting, we analyze the performance of our machine learning model on both the training data and the validation data. If the model shows very high accuracy on the training data but significantly lower accuracy on the validation data, it’s a strong indicator of overfitting. The model has learned to perform well on the training data but fails to generalize on the unseen data. For this reason, we advise separating the test dataset to evaluate the model’s performance objectively, without further tweaking of hyperparameters.

Model evaluation

Evaluating the model performance typically involves specific metrics, often referred to as Key Performance Indicators (KPIs). For regression problems, we can use metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared and more. These metrics quantify the difference between the predicted values and the actual values, with lower values indicating better performance.

After evaluating models with predefined metrics, we have to compare model performance and choose the one that performs best on the test dataset. This model then becomes the basis for practical applications at Polygal.

Model deployment

Once the best-performing model is selected, we should save it in a format that can be easily integrated into Polygal’s manufacturing process. Most machine learning libraries provide formats for saving models, allowing them to be loaded and used later with minimal effort.

There are a couple of ways to deploy the model, such as:

Use Amazon SageMaker
Build Python backend API and deploy it to the Google Cloud Platform (GCP)

Let’s see how they differ.

Using Amazon SageMaker

Amazon SageMaker is a fully managed service allowing developers and data scientists to build, train, and deploy ML models at scale. It simplifies the ML workflow by providing a range of tools and features designed to streamline the process from data preparation to model deployment.

While you pay only for what you use with SageMaker, it may incur additional costs, especially for resource-intensive models like GPT. SageMaker offers two types of payment – On-Demand Pricing (without minimum fees and upfront commitments) and Savings Plans that offer a flexible, usage-based pricing model, but you should commit to a certain amount of usage over a period of time, typically one or three years.

Build a Python backend API and deploy it to the GCP

Another option is to build a custom Python backend API responsible for processing input data, passing it to the model, and returning predictions. Once the API is developed, there are two options to follow. The first one is to embed the API directly into the application’s backend service, which allows for seamless interaction without relying on external services. This approach is particularly well-suited for scenarios where the model is relatively lightweight and doesn’t require significant computational resources.

Alternatively, you can deploy the API as a separate microservice, allowing it to operate independently from the main application. Each microservice can handle specific tasks, facilitating easier management and updates.

Talking of Polygal, the most feasible and cost-effective solution would be to embed the model within the application or use a microservices architecture. This approach allows for efficient utilization of resources while ensuring seamless integration with the app.

Monitoring and maintenance

Once deployed, the model becomes a valuable tool for business operations, just like having an expert at hand. While it can’t replace human expertise entirely, it can handle routine tasks efficiently and accurately. Over time, the model can be refined and enhanced based on feedback and new data. What’s more, we should continuously improve the model by iterating on the training and validation steps, and experimenting with different algorithms and parameters.

Given that the approach to building regression models differs from the one related to LLMs, the toolset is also different. Let’s look at the most popular tools used to develop multiple regression algorithms.

Tools and platforms to build multiple regression models

Here’s a brief overview of some key libraries we commonly use in the ML pipeline:

Pandas is a popular Python library used for data manipulation and analysis. It provides easy-to-use data structures and functions for cleaning, exploring, and preprocessing datasets.
Matplotlib is also a Python library designed for creating static, interactive, and animated visualizations. It offers a wide range of plotting functions to build various types of charts and graphs, helping analysts gain insights from the data.
One more Python library serving as the core for ML tasks is Scikit-learn. It offers a comprehensive suite of algorithms for classification, regression, clustering, and more. Additionally, scikit-learn provides tools for model selection, evaluation, and hyperparameter tuning, making it a versatile and essential tool for ML engineers.

Of course, we’re not limited to just these platforms. For instance, when working with Gradient Boosting Trees models, we often turn to Advanced Data Analysis by ChatGPT. This tool provides robust features for writing and testing code, along with an intuitive interface allowing developers to interact with the model effectively.

Bottom line

Using AI to predict the exact amounts of ingredients for Polygal’s chemical recipes is a working approach, but it’s important to choose the right type of AI model for this task. For example, LLMs are powerful for processing and generating human-like text, but they may not be the best fit for precise, quantitative predictions required in chemical manufacturing.

Instead, predictive AI models tailored to regression tasks, such as Multivariate Multiple Regression, can provide more accurate forecasts. These models are designed to understand the relationship between multiple independent variables and multiple dependent variables. For example, they could predict "Sales" and "Customer Satisfaction" based on several independent variables like "Marketing Spend" and "Product Quality." In our case, MMR can predict exact ingredient quantities and production settings based on the manufacturing specifications of chemical products.

If you’re looking to develop a robust multiple regression model or integrate AI into your app, we have a proven approach to get your project up and running. Reach out to our team to discuss the details.