ML Topics: Regression

ML Topics: Regression


After talking about classification and clustering, let’s talk about another popular machine learning task: regression analysis.

What is regression?

Regression is a supervised learning approach which models a target value based on independent predictors. In other words, regression is an analyzing method which estimates the relationships between a dependent variable (outcome variable) and one or more independent variables (predictors). This method is mostly used for forecasting and finding out cause-and-effect relationship between variables. In other words, regression is a techniques which predict continuous responses—for example, changes in temperature or fluctuations in electricity demand. If the nature of data response is a real number –such as temperature, regression techniques will be a great choice.

A well-known example of regression is the prediction of housing prices. With several features are known, such as floor plan, unit size, distance to specific landmarks, amenities …etc. The algorithms could then predict a price for your house and the amount you can sell it for.

Common regression techniques

Linear regression

Linear regression is one of the most basic version of regression. It is an approach for predicting a response using a single feature. Considering a dataset where we have a value of response y for every feature x (left half of Figure 1), the regression task is to find a line which fits best in the scatter plot so that we can predict the response for any new feature values. After implementing some linear regression algorithms, we will find a line which can fit the scatter data points (the blue line in the right half of Figure 1), and the line is called regression line.

Linear Regression
Figure 1: Linear Regression

In short, linear regression is a statistical modeling technique used to describe a continuous response variable as a linear function of one or more predictor variables. Because linear regression models are simple to interpret and easy to train, they are often the first model to be fitted to a new dataset.

Nonlinear regression

In contrast to linear regression, nonlinear regression is a statistical modeling technique that helps describe nonlinear relationships in experimental data. Nonlinear regression models are generally assumed to be parametric, where the model is described as a nonlinear equation. When the data shows strong nonlinear trends and cannot be easily transformed into a linear space, nonlinear regression is a favorable choice.

“Nonlinear” refers to a fit function that is a nonlinear function of the parameters. For example, if the fitting parameters are \(C_{0}\), \(C_{1}\), and \(C_{2}\): the equation \( y=C_{0}+C_{1}x+C_{2}x^{2} \) is a linear function of the fitting parameters, whereas \(y=\frac{C_{0}x^{C_{1}}}{x+C_{2}}\) is a nonlinear function of the fitting parameters.

Gaussian Process Regression (GPR) Model

When trying to interpolate spatial data, such as hydrogeological data for the distribution of groundwater, Gaussian process regression (GPR) models are popular in this field. GPR models, also referred to as Kriging, are nonparametric models that are used for predicting the value of a continuous response variable. It is a method of interpolation for which the interpolated values are modeled by a Gaussian process governed by prior covariances.

Gaussian Process Regression Model
Figure 2: GPR Model

Figure 2 is a simple example of the GPR model generated by scikit-learn (Gaussian Processes regression: basic introductory example) under Python environment. GPR models are widely used in the field of spatial analysis for interpolation in the presence of uncertainty. Also, it is common to use them as a surrogate model to facilitate optimization of complex designs such as automotive engines.

SVM Regression (SVR)

If there will be a large number of predictor variables in your data; or, facing with high-dimensional data, SVM regression is a common solution. SVM can be used for not only classification, but regression algorithms. SVM regression algorithms work like SVM classification algorithms with several modifications that make it able to predict a continuous response. Instead of finding a hyperplane that separates data, SVM regression algorithms find a model that deviates from the measured data by a value no greater than a small amount, with parameter values that are as small as possible. Figure 3 is an example of 1D SVM regression using linear, polynomial and RBF kernels.

SVR model
Figure 3: SVM Regression

Regression Tree

Decision trees can also be used to solve regression problems when the decision tree has a continuous target variable. The main difference between the classification tree analysis and the regression tree analysis is the nature of predicted outcome. The predicted outcome of the classification tree is the class (discrete) to which the data belongs; while it could be a real number (e.g. the price of a house, or a patient’s length of stay in a hospital) for the regression tree. Therefore, the regression tree can be considered as a variant of decision trees, which is designed to approximate real-valued functions, instead of being used for classification methods. Figure 4 is an example of 1D regression with decision tree. The decision tree here is used to fit a sine curve with addition noisy observation. As a result, it learns local linear regressions approximating the sine curve.

Regression Tree
Figure 4: Regression Tree

In a nutshell …

Regression, or regression analysis, is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables. It is widely used for prediction and forecasting. Thus, regression algorithms is the predominant empirical tool in economics and finance industries. For example, it is often used to predict consumption spending, fixed investment spending, inventory investment, revenues and expense forecasting, and analyzing the systematic risks of an investment. Also, regression is widely applied to predict a trend line which represents the variation in some quantitative data with passage of time (like GDP, oil prices, etc.).

Followed with our previous blog articles, classification, clustering, and regression are the three most popular and well-known machine learning categories that every ML enthusiast must know and they are also the good place to start for people who want to learn ML as well. Hope you like our articles and we will have more ML topics in the future!


Related articles:

Editor: Chieh-Feng Cheng
Ph.D. in ECE, Georgia Tech
Technical Writer, inwinSTACK


Select list(s)*