Data Preprocessing

Preprocessing is an activity that is normally carried out before using the time-series in actual models or algorithms. Preprocessing is meant to enhance certain characteristics of the processes being studied or to remove specific problems from the same data. An example is detrending that simply removes a linear trend from a series of observations so that the subsequent analysis can be trend independent. Or we might take the logarithm of each data sample because we suppose that the process is an exponential one. Or we might try to reduce noise or complete missing data.

Spline Smoothing

To complete missing data one of the possibilities is to interpolate between known values and reconstruct missing data points. Spline Smoothing does just that, allows you to rebuild using a smooth spline the data that is not in your dataset.

Linear Detrend

Linear detrending is one of the most used and abused methods of preprocessing. It is very useful if the data we’re analyzing is composed by an underlying trend and some other superimposed additive signal, of course we may see a trend also where there is none and the algorithm simply removes what is the apparent trend from the data.

Rescale

Rescaling is often useful to “move” data in a better numeric range. For example it is possible to scale a signal whose values were originally spread between -0.01 and +0.01 to the new scale of 1.0 to 10.0. This allows us to apply further tranformations like using the square root or logarithm.

Normalize, z Score

Normalize is a classical preprocessing option. It removes the mean from the samples and changes their variance to one. In effect it is a sort of rescaling followed by a division by the square root of the variance (i.e. of the standard deviation) of the signal. This is also called z Score.

$Transformed_i = { { Sample_i - \mu } \over \sigma }$

Multiply

Muiltiply changes the signal in a very simple way multiplying each sample of the signal by a constant value. This is useful when values in the signal are too small or too big or when applying certain operations like multiplying or dividing by π.

$Transformed_i = Sample_i \cdot k$

Difference

Difference calculates the first difference of a signal that is very roughly equivalent to taking the first derivative of the signal. Of course the signal is not continuous but discrete so that’s very different but this is a computation that is useful in several algorithms.

$Transformed_i = Sample_i - Sample_{i-1}$

Log Abs Difference

The transformation computes the logarithm of the absolute value of the first difference of the signal.

$Transformed_i = \log( | Sample_i - Sample_{i-1} |)$

Log and Log of Log

Log and Log of Log are very useful when signal do exhibit an exponential trend. This operation removes the need to build complex or nonlinear models.

$Transformed_i = \log( Sample_i )$

Exp and Exp of Exp

Exp and Exp of Exp are the functional homologous of Log and Log of Log. These transformations do not compute the inverse of the previous ones due to rescaling that happens when there are negative numbers.

$Transformed_i = e ^ {Sample_i}$

Root

Root takes the n-th root of each sample of the signal.

$Transformed_i = Sample_i ^ { 1 \over n}$

Power

Power is equivalent to Root in that is raises to the n-th power each sample of the signal (Root is simply Power using as exponent 1/n).

$Transformed_i = Sample_i ^ n$

Tanh

Tanh computes the hyperbolic tangent of each sample of the signal.

$Transformed_i = tanh( Sample_i )$

Logistic

Computes the logistic transformation on the signal. The logistic transformation is computed using the following formula:

$Transformed_i = \log { Sample_i \over {1.0 - Sample_i} }$

Absolute Value

Absolute value replaces each negative sample by the corresponding positive value.

$Transformed_i = | Sample_i |$

Box-Cox Transform

Computes the Box-Cox transformation of the signal. The transform is computed using the following formula:

$Transformed_i = { { Sample_i ^ \lambda – 1.0} \over \lambda}$

Log of Ratio

This transformation is normally used in finance and is the logarithm of the ratio between elements of the signal. This is roughly equivalent to using the percentage change from sample to sample.

$Transformed_i = \log { Sample_i \over Sample_{i-1} }$

Forecasting Methods		Holt Winter’s, Series Decomposition and Wavelet Benchmarks
Time Series Forecasting		Use of the Moving Average in Time-series Forecasting
Forecasting Concepts		Denoising Techniques
Error Statistics		Computational Performance
Fast Fourier Transform		Moving Averages
Kernel Smoothing		Active Moving Average
Savitsky-Golay Smoothing		Fractal Projection
Downloading Financial Data from Yahoo		Multiple Regression
Digital Signal Processing		Principal Component Analysis
Curve Analysis		Options Pricing with Black-Scholes
Markowitz Optimal Portfolio		Time-series preprocessing

iPredict

Time-series forecasting software