Time Series Analysis and Mining Sequence Data

Mining Sequence Data.

A sequence is an ordered and successive flow of related data items or events. Some of the important sequences are 1) Biological sequence data, 2) Symbolic sequence data and 3) Time Series data.

Biological sequences: This includes DNA and protein sequences. They are important in the description of DNA, RNA and proteins, and are among the basic entities studied in molecular and computational biology. These sequences are long and can contain important and hidden semantic information.

Symbolic Sequences: consist of an ordered set of elements or events, recorded with or without a concrete notion of time. There are many applications involving data of symbolic sequences such as customer shopping sequences, web click streams, program execution sequences, biological sequences, and sequences of events in science and engineering and in natural and social developments.

Time Series

Is a series of values of a quantity obtained at successive times, often with equal intervals between them. According to Merriam-Webstar dictionary time series is “a set of data collected sequentially usually at fixed intervals of time”.

Figure-1:
Univariate times series with trend

Figure-2:
Random component data plus trend

Time Series (TS) Analysis

Time series analysis is a statistical technique to analyse, model the underlying structure pattern of data points taken over time and also to forecast the future. The use cases of time series analysis is growing rapidly due to big data generation in Finance, Health, Agriculture, Internet of Things, Cyber security and so on.

Objectives of TS Analysis Description:

Descriptive analysis determines trend, seasonality/cyclicity, outliers, sudden changes or breaks. Explanation: Using one TS to explain another and help understand how similar time series behave. Prediction: Same as forecasting. Control: To improve control over a physical process by monitoring to alert when conditions exceed a priori determined threshold

The major components or patterns that are analysed through time series are:

· Trend - Indicates whether observed data in the series increase or decrease over longer a period in a time series without season related and irregular effects, and is a reflection of the underlying level. It is the result of influences such as population growth, price inflation and general economic changes.

· Seasonality - Once the trend component is estimated, it can be removed from the original data, leaving behind the combined seasonal, cyclic and irregular components. Fluctuations in the pattern due to seasonality components related to the calendar example retail sales that fluctuate over the year due to months and holidays. The seasonal component consists of effects that are reasonably stable with respect to timing, direction and magnitude. It arises from systematic, calendar related influences such as: Natural Conditions - weather fluctuations that are representative of the seasons (uncharacteristic weather patterns such as snow in summer would be considered irregular influences). Business and Administrative procedures - start and end of the school term. Social and Cultural behaviour – Festivals like Christmas, New Year etc. Trading Day Effects - the number of occurrences of each of the day of the week in a given month will differ from year to year. Moving Holiday Effects – Easter, Onam, Ramzan, Vishu, Chinese New Year, etc.

Cyclicity - This refers to non-seasonal variations occurring at regular intervals around the trend due to some circumstances or physical reasons such as temperature etc. Both cyclical and seasonal have 'peak-and-trough' patterns. The difference is in the period between successive peaks (or troughs). Eg. soft drink sales will show a spike EVERY summer and a trough EVERY winter. A cyclic pattern exists when data exhibit rises and falls that are not of fixed period.

Figure-3:

Major components of Time Series

Irregularity - Instability due to random factors that do not repeat in the pattern.

Figure-4:

The above figure shows a time series with irregularity. The amount of data points varies year by year and no regular time intervals between them.

The following figures show how the irregular component (sometimes also known as the residual) after estimation by removing the seasonal and trend components

Figure-5:

Time series decomposition

Time Series Analysis for Data-driven Decision-Making

Time series analysis helps in analyzing the past, which comes in handy to forecast the future. The method is extensively employed in a financial and business forecast based on the historical pattern of data points collected over time and comparing it with the current trends. This is the biggest advantage used by organizations for decision making and policy planning by several organizations.

Time Series Analysis and Its Applicability

Time Series analysis is “an ordered sequence of values of a variable at equally spaced time intervals.” It is used to understand the determining factors and structure behind the observed data, choose a model to forecast, thereby leading to better decision making.

The Time Series Analysis is applied for various purposes, such as:

· Stock Market Analysis

· Economic Forecasting

· Inventory studies

· Budgetary Analysis

· Census Analysis

· Yield Projection

· Sales Forecasting and more.

Forecasting

Forecasting is a method or a technique for estimating future aspects of a business or the operation. It is a method for translating past data or experience into estimates of the future. Time series forecasting is the use of a model to predict future values based on previously observed values. Time series are widely used for non-stationary data, like economic, weather, stock price, and retail sales in this post.

Classical time series forecasting methods may be focused on linear relationships, nevertheless, they are sophisticated and perform well on a wide range of problems, assuming that the data is suitably prepared and the method is well configured.

11 different classical time series forecasting methods; they are:

Autoregression (AR),
Moving Average (MA),
Autoregressive Moving Average (ARMA),
Autoregressive Integrated Moving Average (ARIMA),
Seasonal Autoregressive Integrated Moving-Average (SARIMA),
Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX),
Vector Autoregression (VAR),
Vector Autoregression Moving-Average (VARMA),
Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX),
Single Exponential Smoothing (SES),
Holt Winter’s Exponential Smoothing (HWES)

Autoregression (AR)

AR - stands for Autoregression. The autoregression (AR) method models the next step in the time sequence as a linear function of the observations at prior time steps. This means that the next observation can predicted as function of its time-lagged observations.

Autoregressive Model (AR)

The Autoregressive (AR) model forecasts the future, deriving the behavioural pattern from the past data. It is useful when there is a correlation between the data in a time series. The model is based on the linear regression of the data in the current time series against the previous data on the same series.

As an example we may predict the value for the time step (t) given the observations at the last two time steps (t-1 and t-2). As a regression model, assume we represent the global temperature x this year as x_t. Using measurements of global temperature in the previous two years (x_t₋₁, x_t₋₂) the autoregressive model can be expressed as:

$x\left ( t \right )=b_{1}\times x\left ( t-1 \right )+b_{2}\times x\left ( t-2 \right )+\varepsilon _{t}$

The order of an autoregression is the number of immediately preceding values in the series that are used to predict the value at the present time. In the above equation the order is 2.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables and when the current value of the series is correlated to its previous time step observations.

Autocorrelation

Time-series analysis use techniques to derive insights on autocorrelated data or correlation of a series. However, the models need to be chosen properly to get accurate results.

An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step. This relationship between variables is called correlation.

Consider the correlation between two variables. If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the change in one variable is positive and the change in the other variable is negative the directions are opposite. Then it is called negative correlation.

In an AR model we can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags.

Because correlation is calculated between the variable and itself at previous time steps, it is called an autocorrelation. It is also called serial correlation because of the sequenced structure of time series data.

The correlation statistics can also help to choose which lag variables will be useful in a model and which will not.

Interestingly, if all lag variables show low or no correlation with the output variable, then it suggests that the time series problem may not be predictable. This can be very useful when getting started on a new dataset.

The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modelling.

Moving Average (MA)

MA - or the Moving Average method is a widely used technique to predict the future data in time series analysis. It predicts the next observation value as the average of a select set of time periods. The method smoothens out the random variations in the series and helps to indicate its trend and seasonality.

It models the next step in the sequence as a linear function from a mean process at prior time steps.

$\overline{x}_{sm}=\frac{x_{t}+x_{t-1}+...+x_{t-\left ( N-1 \right )}}{N}$

The above equation models moving average of a univariate time series. The model defines that the output variable is linearly contingent on present and the past data of a time series. It uses past errors in the forecast in a regression instead of the past value of the forecast variable.

Moving average helps in reducing the “noise” in the series (e.g., price). If, in a priced chart, the moving average in angled upward, it suggests a rise in price. Where as if it points downward, it indicates the price is going down. In case it is moving only sideways, the price is likely to be in range. The method is suitable for univariate time series without trend and seasonal components.

Figure-6:

Autoregressive Moving Average (ARMA)

The Autoregressive Moving Average (ARMA) method models the next step in the sequence as a linear function of previous observations and also the average of a fixed number of prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models. Given a time series of data X_t, the ARMA model is a tool for understanding and predicting future values in this series. The AR part involves regressing the variable on its own lagged (i.e., past) values. The MA part involves linear combination of error terms occurring contemporaneously and at various times in the past. The model is usually referred to as the ARMA(p,q) model where p is the order of the AR part and q is the order of the MA part.

We can use the ARMA class to create an MA model and setting a p = 0, (zeroth-order AR model) and q ≠ 0 and also vice-versa.

What does it mean for data to be stationary?

A time series is stationary if its statistical properties remain constant over time. The following description explains this point further.

For a TS to be stationary the mean of the series should not be a function of time. The red graph below is not stationary because the mean increases over time.

Figure-7:

Similarly variance of the series should not be a function of time. This property is known as homoscedasticity. Notice in the red graph the varying spread of data over time.

Figure-8:

Finally, the covariance of the i^th term and the (i + m)^th term should not be a function of time. In the following red graph, you will notice the spread becomes closer as the time increases. Hence, the covariance is not constant with time for the ‘red series’.

Figure-9:

Why stationarize the data?
Why is this important? When running a linear regression the assumption is that all of the observations are all independent of each other. In a time series, however, we know that observations are time dependent. It turns out that a lot of results that hold for independent random variables (law of large numbers and central limit theorem etc) hold for stationary random variables.

Non-stationary time series need to be at least locally stationary to be modelled. If they are not, we won't have enough observations at each time point to be able to make reasonable estimates.

ARMA modelling works best on the stationary series as on the non-stationary ones ARMA processes become explosive (that is, they go to infinity)

It is possible to fit a non-stationary model to time series but that won't be an accurate ARMA model.

So by making the data stationary, we can actually apply Autoregressive and Moving average techniques to this time dependent variable.

Two methods for eliminating non-stationarity

Differencing – This method computes the difference between TS samples from current value and previous values with a time lag. Differencing removes both trend and trend.

Decomposition - This approach models trend and seasonality separately.

Autoregressive Integrated Moving Average (ARIMA)

A stationary (time) series is one whose statistical properties such as the mean, variance and autocorrelation are all constant over time. Hence, a non-stationary series is one whose statistical properties change over time.

ARIMA models are applied to time sequence in which data show evidence of non-stationarity, where an initial differencing step (corresponding to the "integrated" part of the model) can be applied one or more times to eliminate the non-stationarity.

"I" - stands for Integrated, where raw observation is differenced and is used to make the time series stationary. The AR (p) part of ARIMA shows that the time series is regressed on its own past data. The MA (q) part of ARIMA indicates that the forecast error is a linear combination of past respective errors. The I part of ARIMA shows that the data values are replaced with differenced values of “d” order to obtain stationary data. Differencing is a method of transforming a non-stationary time series into a stationary one.

It is a class of model that captures a suite of different standard temporal structures in time series data.

The Autoregressive Integrated Moving Average (ARIMA) method models the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models as well as a differencing pre-processing step of the sequence called integration (I) to make the sequence stationary.

In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting).

The components of ARIMA are AR, MA and I. Each component is defined as a parameter which is substituted as integers to indicate the usage of the ARIMA model.

Non-seasonal ARIMA predictors are generally modelled by ARIMA(p,d,q) where predictor parameters p, d, and q are non-negative integers, p is the order (number of time lags) of the autoregressive model, q is the order of the moving-average model and d is the degree of differencing (the number of times the data have had past values subtracted). An ARIMA model can be used to develop AR or MA models.

(The method is suitable for univariate time series with trend and without seasonal components.)

Figure-10:

Actual vs ARIMA fitted plot of bajra production time series

Season Variations

In time series data, seasonality is the presence of variations that occur at specific regular intervals less than a year, such as weekly, monthly, or quarterly. Seasonality may be caused by various factors, such as weather, vacation, and holidaysand consists of periodic, repetitive, and generally regular and predictable patterns in the levelsof a time series.

A repeating pattern within any fixed period is known as seasonal variation.

Understanding the seasonal component in time series can improve the performance of modeling with machine learning.

This has two major advantages:

Clearer Signal: Identifying and removing the seasonal component from the time series can result in a clearer relationship between input and output variables.
More Information: Additional information about the seasonal component of the time series can provide new information to improve model performance.

Figure-11:

Seasonal variations in Electric Power

Removing Seasonality

Once seasonality is identified, it can be modelled.

The model of seasonality can be removed from the time series. This process is called Seasonal Adjustment, or Deseasonalizing.

A time series where the seasonal component has been removed is called seasonal stationary. A time series with a clear seasonal component is referred to as non-stationary.

Seasonal Autoregressive Integrated Moving-Average (SARIMA)

Autoregressive Integrated Moving Average, or ARIMA, is one of the most widely used forecasting methods for univariate time series data forecasting.

Although the method can handle data with a trend, it does not support time series with a seasonal component.

An extension to ARIMA that supports the direct modelling of the seasonal component of the series is called SARIMA.

For modelling with SARIMA, the method uses has 7 parameters. The first 3 parameters are the same as an ARIMA model. The last 4 define the seasonal process. It takes the seasonal autoregressive component, the seasonal difference, the seasonal moving average component, the length of the season, as additional parameters. In this sense the ARIMA model that we have already considered is just a special case of the SARIMA model, i.e. ARIMA(1,1,1) = SARIMA(1,1,1)(0,0,0,X) where X can be any whole number.

Figure-12:

Decomposition of TS with SARIMA

The Seasonal Autoregressive Integrated Moving Average (SARIMA) method models the next step in the sequence as a linear function of the differenced observations, errors, differenced seasonal observations, and seasonal errors at prior time steps.

It combines the ARIMA model with the ability to perform the same autoregression, differencing, and moving average modeling at the seasonal level.

A SARIMA model can be used to develop AR, MA, ARMA and ARIMA models.

Seasonal ARIMA models are usually denoted ARIMA(p,d,q)(P,D,Q)_m, where m refers to the number of periods in each season, and the uppercase P,D,Q refer to the autoregressive, differencing, and moving average terms for the seasonal part of the ARIMA model.

Figure-13:

Prediction with SARIMA

(The method is suitable for univariate time series with trend and/or seasonal components).

Exogeneous and Endogeneous Variable

In an economic model, an exogenous variable is one whose value is determined outside the model and is imposed on the model, and an exogenous change is a change in an exogenous variable.

In contrast, an endogenous variable is a variable whose value is determined by the model. An endogenous change is a change in an endogenous variable in response to an exogenous change that is imposed upon the model. In econometrics, an exogenous variable is assumed to be fixed in repeated sampling, which means it is a non-stochastic variable. An implication of this assumption is that the error term in the econometric model is independent of the exogenous variable.

In the simple supply and demand model, a change in consumer tastes is unexplained by the model and imposes an exogenous change in demand that leads to a change in the endogenous equilibrium price and the endogenous equilibrium quantity transacted. Here the exogenous variable is a parameter conveying consumer tastes. Similarly, a change in the consumer's income is exogenously given, outside the model, and appears in the model as an exogenous change in demand.

In the LM model of interest rate determination, the supply of and demand for money determine the interest rate contingent on the level of the money supply, so the money supply is an exogenous variable and the interest rate is an endogenous variable.

In a model of firm behaviour with competitive input markets, the prices of inputs are exogenously given, and the amounts of the inputs to use are endogenous.

Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)

The Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX) is an extension of the SARIMA model that also includes the modelling of exogenous variables.

Exogenous variables are also called covariates and can be thought of as parallel input sequences that have observations at the same time steps as the original series. The primary series may be referred to as endogenous data to contrast it from the exogenous sequence(s). An exogenous variable is a variable that is not affected by endogenous variables in the time series. For example, take a simple causal system like farming. Variables like weather, farmer skill, pests, and availability of seed are all exogenous to crop production. The observations for exogenous variables are included in the model directly at each time step and are not modelled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

Figure-14:

Decomposition of TS with SARIMAX

The SARIMAX method can also be used to model the subsumed models with exogenous variables, such as ARX, MAX, ARMAX, and ARIMAX.

(The method is suitable for univariate time series with trend and/or seasonal components and exogenous variables.)

Vector autoregression (VAR) is a stochastic process model used to capture the linear interdependencies among multiple time series. VAR models generalize the univariate autoregressive model (AR model) by allowing for more than one evolving variable. All variables in a VAR enter the model in the same way: each variable has an equation explaining its evolution based on its own lagged values, the lagged values of the other model variables, and an error term. VARs have found broad application as the foundation of much dynamic macroeconomic modelling.

The Vector Autoregression (VAR) method models the next step in each time series using an AR model. It is the generalization of AR to multiple parallel time series, e.g., multivariate time series.

The notation for the model involves specifying the order for the AR(p) model as parameters to a VAR function, e.g., VAR(p).

The method is suitable for multivariate time series without trend and seasonal components.

Vector autoregressive moving average (VARMA)

VAR modelling though widely used for dynamic modelling and forecasting of multivariate macroeconomic systems, it has limitations. They are not closed under aggregation, marginalization or the presence of measurement error. Secondly, economic models often imply that the observed processes have a vector autoregressive moving average (VARMA) representation with a non-trivial moving average component. Hence a more general in with a moving average VARMA is used. VARMA models forecast macroeconomic variables more accurately than VARs.

The Vector Autoregression Moving-Average (VARMA) method models the next step in each time series using an ARMA model. It is the generalization of ARMA to multiple parallel time series, e.g. multivariate time series.

The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to a VARMA function, e.g., VARMA(p, q). A VARMA model can also be used to develop VAR or VMA models.

(The method is suitable for multivariate time series without trend and seasonal components.)

Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)

The Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX) is an extension of the VARMA model that also includes the modelling of exogenous variables. It is a multivariate version of the ARMAX method.

The observations for exogenous variables are included in the model directly at each time step and are not modelled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The VARMAX method can also be used to model the subsumed models with exogenous variables, such as VARX and VMAX.

(The method is suitable for multivariate time series without trend and seasonal components with exogenous variables.)

Single Exponential Smoothing (SES)

With the moving average (MA) a subsequent time step value was modelled with a set of observed values which are equally weighted. Also called Simple Exponential Smoothing (SES) method models the next time step as an exponentially decreasing weighted linear function of observations at prior time steps. Recent observations are given more weight than older observations for forecasting subsequent values.

(The method is suitable for univariate time series without trend and seasonal components). This is the basic equation of exponential smoothing and the constant parameter 0 ≤ α ≤ 1 is called the smoothing constant. This effect an overall smoothing similar to the MA method.

Figure-15:

Double Exponential Smoothing

Single smoothing does perform well when the time series has a trend. This can be tackled by introduction of a second equation for trend smoothing. The second equation is used in conjunction with single exponential smoothing with α. This constant used in the second equation is denoted by 0 ≤ γ ≤ 1.

Holt Winter’s Exponential Smoothing (HWES)

When the time series contains both trend and seasonal component a third equation is used to smoothen out the seasonal component. The third equation a third constant denoted by 0 ≤ β ≤ 1. The method is called Holt Winter’s Exponential Smoothing (HWES) also called the Triple Exponential Smoothing method. The method models the next time step as an exponentially weighted linear function of observations at prior time steps, taking trends and seasonality into account.

(The method is suitable for univariate time series with trend and/or seasonal components.)

Figure-16:

References:

1. Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques”, Morgan Kaufmann, 3^rd Ed.

2. Chandler and Scott (2011). Statistical Methods for Trend Detection and Analysis in the Environmental Sciences, Wiley

3. https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm

Image Credits:

Figure-1: maplesoft.com

Figure-2: wikipedia.orgFigure-3: itfeature.com

Figure-4: fukamilab.github.io

Figure-5: alkaline-ml.com

Figure-6: java2s.com

Figure-7,8,9: images.squarespace-cdn.com
Figure-10: www.researchgate.net

Figure-11: drg.blob.core.windows.net

Figure-12: i1.wp.com/techrando.com

Figure-13: miro.medium.com

Figure-14: miro.medium.com

Figure-15: researchgate.net

Figure-16: i0.wp.com/www.real-statistics.com

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence