Time Series Analysis and Mining Sequence Data
Mining Sequence Data.
A
sequence is an ordered and successive flow of related data items or events.
Some of the important sequences are 1) Biological sequence data, 2) Symbolic sequence
data and 3) Time Series data.
Biological sequences: This includes DNA and protein
sequences. They are important in
the description of DNA, RNA and proteins, and are among the basic entities
studied in molecular and computational biology. These sequences are long
and can contain important and hidden semantic information.
Symbolic Sequences: consist of an
ordered set of elements or events, recorded with or without a concrete notion
of time. There are many applications involving data of symbolic sequences
such as customer shopping sequences, web click streams, program execution
sequences, biological sequences, and sequences of events in science and
engineering and in natural and social developments.
Time Series
Is a series
of values of a quantity obtained at successive times, often with equal
intervals between them. According to Merriam-Webstar dictionary time series is
“a set of data collected sequentially usually at fixed intervals of time”.
Figure-1:
Univariate times series with trend
Univariate times series with trend
Time Series (TS) Analysis
Time series analysis is a statistical technique to
analyse, model the underlying structure pattern of data points taken over time and
also to forecast the future. The use cases of time series analysis is growing
rapidly due to big data generation in Finance, Health, Agriculture, Internet of
Things, Cyber security and so on.
Objectives
of TS Analysis Description:
Descriptive
analysis determines trend, seasonality/cyclicity, outliers, sudden changes or
breaks. Explanation:
Using
one TS to explain another and help understand how similar time series behave. Prediction:
Same
as forecasting. Control:
To
improve control over a physical process by monitoring to alert when conditions
exceed a priori determined threshold
The major components or patterns that are analysed
through time series are:
· Trend - Indicates whether observed data in the
series increase or decrease over longer a period in a
time series without season related and irregular effects, and is a reflection
of the underlying level. It is the result of influences such as population growth,
price inflation and general economic changes.
· Seasonality - Once the
trend component is estimated, it can be removed from the original data, leaving
behind the combined seasonal, cyclic and irregular components. Fluctuations in the pattern due to seasonality
components related to the calendar example retail sales that fluctuate over the
year due to months and holidays. The seasonal
component consists of effects that are reasonably stable with respect to
timing, direction and magnitude. It arises from systematic, calendar related
influences such as: Natural Conditions - weather
fluctuations that are representative of the seasons (uncharacteristic weather
patterns such as snow in summer would be considered irregular influences). Business and Administrative procedures - start and
end of the school term. Social and Cultural behaviour – Festivals like Christmas, New Year etc. Trading Day Effects - the number
of occurrences of each of the day of the week in a given month will differ from
year to year. Moving Holiday Effects – Easter,
Onam, Ramzan, Vishu, Chinese New Year, etc.
Cyclicity - This refers to non-seasonal variations
occurring at regular intervals around the trend due to some circumstances or
physical reasons such as temperature etc. Both cyclical
and seasonal have 'peak-and-trough' patterns. The difference is
in the period between successive peaks (or troughs). Eg.
soft drink sales will show a spike EVERY summer and a trough EVERY winter. A
cyclic pattern exists when data exhibit rises and falls that are not of fixed
period.
Irregularity - Instability due to random factors that do not repeat in the pattern.
Figure-4:
The above figure shows a time series with irregularity. The amount of
data points varies year by year and no regular time intervals between them.
The following figures show how the irregular component (sometimes also
known as the residual) after estimation by removing the seasonal and trend
components
Figure-5:
Time series decomposition
Time Series Analysis for Data-driven Decision-Making
Time
series analysis helps in analyzing the past, which comes in handy to forecast
the future. The method is extensively employed in a financial and business
forecast based on the historical pattern of data points collected over time and
comparing it with the current trends. This is the biggest advantage used by
organizations for decision making and policy planning by several organizations.
Time Series Analysis and Its Applicability
Time
Series analysis is “an ordered sequence of values of a variable at equally
spaced time intervals.” It is used to understand the determining factors and
structure behind the observed data, choose a model to forecast, thereby leading
to better decision making.
The
Time Series Analysis is applied for various purposes, such as:
·
Stock Market Analysis
·
Economic Forecasting
·
Inventory studies
·
Budgetary Analysis
·
Census Analysis
·
Yield Projection
·
Sales Forecasting and more.
Forecasting
Forecasting is a method or a technique for
estimating future aspects of a business or the operation. It is a method for
translating past data or experience into estimates of the future. Time
series forecasting is the use of a model to predict future values
based on previously observed values. Time series are widely
used for non-stationary data, like economic, weather, stock price, and retail
sales in this post.
Classical
time series forecasting methods may be focused on linear relationships,
nevertheless, they are sophisticated and perform well on a wide range of
problems, assuming that the data is suitably prepared and the method is well
configured.
11 different classical time series
forecasting methods; they are:
- Autoregression (AR),
- Moving Average (MA),
- Autoregressive Moving Average (ARMA),
- Autoregressive Integrated Moving Average (ARIMA),
- Seasonal Autoregressive Integrated Moving-Average (SARIMA),
- Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX),
- Vector Autoregression (VAR),
- Vector Autoregression Moving-Average (VARMA),
- Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX),
- Single Exponential Smoothing (SES),
- Holt Winter’s Exponential Smoothing (HWES)
Autoregression (AR)
AR
- stands for
Autoregression. The autoregression (AR) method models the next step in the time
sequence as a linear function of the observations at prior time steps. This
means that the next observation can predicted as function of its time-lagged
observations.
Autoregressive Model (AR)
The
Autoregressive (AR) model forecasts the future, deriving the behavioural
pattern from the past data. It is useful when there is a correlation between
the data in a time series. The model is based on the linear regression of the
data in the current time series against the previous data on the same series.
As an example we may predict the value for the time step (t) given the observations at the last
two time steps (t-1 and t-2). As a regression model, assume we represent the global temperature x this year as xt. Using measurements of global
temperature in the previous two years (xt−1, xt−2) the autoregressive model can be
expressed as:
The order of an
autoregression is the number of immediately preceding values in the series that
are used to predict the value at the present time. In the above equation
the order is 2.
This technique can be used on time series where input
variables are taken as observations at previous time steps, called lag
variables and when the current value of the series is correlated to its
previous time step observations.
Autocorrelation
Time-series analysis use techniques to derive insights on
autocorrelated data or correlation of a series. However, the models need to be
chosen properly to get accurate results.
An
autoregression model makes an assumption that the observations at previous time
steps are useful to predict the value at the next time step. This relationship
between variables is called correlation.
Consider
the correlation between two variables. If both variables change in the same
direction (e.g. go up together or down together), this is called a positive
correlation. If the change in one variable is positive and the change in the
other variable is negative the directions are opposite. Then it is called
negative correlation.
In
an AR model we can use statistical measures to calculate the correlation
between the output variable and values at previous time steps at various
different lags.
Because
correlation is calculated between the variable and itself at previous time
steps, it is called an autocorrelation.
It is also called serial correlation
because of the sequenced structure of time series data.
The
correlation statistics can also help to choose which lag variables will be
useful in a model and which will not.
Interestingly,
if all lag variables show low or no correlation with the output variable, then
it suggests that the time series problem may not be predictable. This can be
very useful when getting started on a new dataset.
The stronger the correlation between the output variable and
a specific lagged variable, the more weight that autoregression model
can put on that variable when modelling.
Moving Average (MA)
MA - or
the Moving Average method is a widely used technique to predict the future data in time series
analysis. It predicts the next observation value as the average of a select set of
time periods. The method smoothens out the random variations in the series and helps to indicate its trend and
seasonality.
It
models the next step in the sequence as a linear function from a mean process
at prior time steps.
The
above equation models moving average of a univariate time series. The model
defines that the output variable is linearly contingent on present and the past
data of a time series. It uses past errors in the forecast in a regression
instead of the past value of the forecast variable.
Moving
average helps in reducing the “noise” in the series (e.g., price). If, in a priced
chart, the moving average in angled upward, it suggests a rise in price. Where
as if it points downward, it indicates the price is going down. In case it is
moving only sideways, the price is likely to be in range. The method is
suitable for univariate time series without trend and seasonal components.
Figure-6:
The
Autoregressive Moving Average (ARMA) method models the next step in the
sequence as a linear function of previous observations and also the average of
a fixed number of prior time steps.
It
combines both Autoregression (AR) and Moving Average (MA) models. Given a time series of data Xt,
the ARMA model is a tool for understanding and predicting future values in this
series. The AR part involves regressing the variable on its own lagged (i.e.,
past) values. The MA part involves linear combination of error terms
occurring contemporaneously and at various times in the past. The model is
usually referred to as the ARMA(p,q) model where p is
the order of the AR part and q is the order of the MA part.
We
can use the ARMA class to create an MA model and setting a p = 0, (zeroth-order AR model) and q ≠ 0 and also vice-versa.
What does it
mean for data to be stationary?
A time series
is stationary if its statistical properties remain constant over time. The
following description explains this point further.
- For a TS to be stationary the mean of the series should not be a function of time. The red graph below is not stationary because the mean increases over time.
Figure-7:
- Similarly variance of the series should not be a function of time. This property is known as homoscedasticity. Notice in the red graph the varying spread of data over time.
- Finally, the covariance of the ith term and the (i + m)th term should not be a function of time. In the following red graph, you will notice the spread becomes closer as the time increases. Hence, the covariance is not constant with time for the ‘red series’.
Figure-9:
Why stationarize the data?
Why is this important? When running a linear regression the assumption is that all of the observations are all independent of each other. In a time series, however, we know that observations are time dependent. It turns out that a lot of results that hold for independent random variables (law of large numbers and central limit theorem etc) hold for stationary random variables.
Why is this important? When running a linear regression the assumption is that all of the observations are all independent of each other. In a time series, however, we know that observations are time dependent. It turns out that a lot of results that hold for independent random variables (law of large numbers and central limit theorem etc) hold for stationary random variables.
Non-stationary time series need to be at least locally
stationary to be modelled. If they are not, we won't have enough observations
at each time point to be able to make reasonable estimates.
ARMA modelling works best on the stationary series as on the non-stationary ones ARMA processes become explosive (that is, they go to infinity)
It is possible to fit a non-stationary model to time series but that won't be an accurate ARMA model.
So by making the data stationary, we can actually apply Autoregressive and Moving average techniques to this time dependent variable.
Two methods for eliminating non-stationarity
Differencing – This method computes the difference
between TS samples from current value and previous values with a time lag. Differencing
removes both trend and trend.
Decomposition - This approach models trend and seasonality separately.
Autoregressive Integrated Moving Average (ARIMA)
A stationary (time) series is one whose statistical properties such as the mean, variance and autocorrelation are all constant over time. Hence, a non-stationary series is one whose statistical properties change over time.
ARIMA models are applied to time sequence in which data show evidence of non-stationarity, where an initial differencing step (corresponding to the "integrated" part of the model) can be applied one or more times to eliminate the non-stationarity.
"I" - stands for Integrated, where raw observation is differenced and is used to make the time series stationary. The AR (p) part of ARIMA shows that the time series is regressed on its own past data. The MA (q) part of ARIMA indicates that the forecast error is a linear combination of past respective errors. The I part of ARIMA shows that the data values are replaced with differenced values of “d” order to obtain stationary data. Differencing is a method of transforming a non-stationary time series into a stationary one.
It is a class of model that captures a suite of different standard temporal structures in time series data.
The Autoregressive Integrated Moving Average (ARIMA) method models the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps.
It combines both Autoregression (AR) and Moving Average (MA) models as well as a differencing pre-processing step of the sequence called integration (I) to make the sequence stationary.
In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting).
The components of ARIMA are AR, MA and I. Each component is defined as a parameter which is substituted as integers to indicate the usage of the ARIMA model.
Non-seasonal ARIMA predictors are generally modelled by ARIMA(p,d,q) where predictor parameters p, d, and q are non-negative integers, p is the order (number of time lags) of the autoregressive model, q is the order of the moving-average model and d is the degree of differencing (the number of times the data have had past values subtracted). An ARIMA model can be used to develop AR or MA models.
(The method is suitable for univariate time series with trend and without seasonal components.)
Figure-10:
Actual vs ARIMA fitted plot of bajra production time series
Season Variations
In time series data, seasonality is
the presence of variations that occur at specific regular intervals less than a
year, such as weekly, monthly, or quarterly. Seasonality may be caused by
various factors, such as weather, vacation, and holidays and
consists of periodic, repetitive, and generally regular and predictable
patterns in the levels of a time series.
A repeating pattern within any fixed period is known as seasonal variation.
Understanding
the seasonal component in time series can improve the performance of modeling
with machine learning.
This
has two major advantages:
- Clearer Signal: Identifying and removing the seasonal component from the time series can result in a clearer relationship between input and output variables.
- More Information: Additional information about the seasonal component of the time series can provide new information to improve model performance.
Figure-11:
Seasonal variations in
Electric Power
Removing Seasonality
Once
seasonality is identified, it can be modelled.
The
model of seasonality can be removed from the time series. This process is
called Seasonal Adjustment, or Deseasonalizing.
A
time series where the seasonal component has been removed is called seasonal
stationary. A time series with a clear seasonal component is referred to as
non-stationary.
Seasonal Autoregressive Integrated Moving-Average (SARIMA)
Autoregressive
Integrated Moving Average, or ARIMA, is one of the most widely used forecasting
methods for univariate time series data forecasting.
Although
the method can handle data with a trend, it does not support time series with a
seasonal component.
An
extension to ARIMA that supports the direct modelling of the seasonal component
of the series is called SARIMA.
For modelling with SARIMA, the method uses has 7
parameters. The first 3 parameters are the same as an ARIMA model. The last 4
define the seasonal process. It takes the seasonal autoregressive component,
the seasonal difference, the seasonal moving average component, the length of
the season, as additional parameters. In this sense the ARIMA model that we
have already considered is just a special case of the SARIMA model, i.e.
ARIMA(1,1,1) = SARIMA(1,1,1)(0,0,0,X) where X can be any whole number.
Figure-12:
Decomposition of TS with
SARIMA
The Seasonal Autoregressive
Integrated Moving Average (SARIMA) method
models the next step in the sequence as a linear function of the differenced
observations, errors, differenced seasonal observations, and seasonal errors at
prior time steps.
It
combines the ARIMA model with the ability to perform the same autoregression,
differencing, and moving average modeling at the seasonal level.
A
SARIMA model can be used to develop AR, MA, ARMA and ARIMA models.
Seasonal ARIMA models are usually denoted ARIMA(p,d,q)(P,D,Q)m,
where m refers to the number of periods in each season, and
the uppercase P,D,Q refer to the autoregressive,
differencing, and moving average terms for the seasonal part of the ARIMA
model.
Figure-13:
Prediction with SARIMA
(The
method is suitable for univariate time series with trend and/or seasonal
components).
Exogeneous and Endogeneous Variable
In
an economic model, an exogenous variable is one
whose value is determined outside the model and is imposed on the model, and
an exogenous change is a change in an exogenous variable.
In
contrast, an endogenous variable is a variable whose value is
determined by the model. An endogenous change is a change in
an endogenous variable in response to an exogenous change that is imposed upon
the model. In econometrics, an exogenous variable is assumed
to be fixed in repeated sampling, which means it is a non-stochastic variable.
An implication of this assumption is that the error term in the econometric model is
independent of the exogenous variable.
In
the simple supply and demand model, a change in consumer
tastes is unexplained by the model and imposes an exogenous change in demand
that leads to a change in the endogenous equilibrium price and the endogenous
equilibrium quantity transacted. Here the exogenous variable is a parameter conveying
consumer tastes. Similarly, a change in the consumer's income is exogenously
given, outside the model, and appears in the model as an exogenous change in
demand.
In
the LM model of interest rate
determination, the supply of and demand for money determine the interest rate contingent
on the level of the money supply, so the money supply is
an exogenous variable and the interest rate is an endogenous variable.
In
a model of firm behaviour with competitive input
markets, the prices of inputs are
exogenously given, and the amounts of the inputs to use are endogenous.
Seasonal Autoregressive Integrated Moving-Average with
Exogenous Regressors (SARIMAX)
The
Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX) is an extension of the SARIMA model
that also includes the modelling of exogenous variables.
Exogenous
variables are also called covariates and can be thought of as parallel input
sequences that have observations at the same time steps as the original series.
The primary series may be referred to as endogenous data to contrast it from
the exogenous sequence(s). An exogenous variable
is a variable that is not affected by endogenous variables in the
time series. For example, take a simple causal system like farming. Variables
like weather, farmer skill, pests, and availability of seed are all exogenous
to crop production. The observations for exogenous variables are
included in the model directly at each time step and are not modelled in the
same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).
Figure-14:
Decomposition of TS with
SARIMAX
The
SARIMAX method can also be used to model the subsumed models with exogenous
variables, such as ARX, MAX, ARMAX, and ARIMAX.
(The
method is suitable for univariate time series with trend and/or seasonal
components and exogenous variables.)
Vector autoregression (VAR)
is a stochastic
process model used to capture the
linear interdependencies among multiple time series. VAR models generalize the univariate autoregressive
model (AR model) by
allowing for more than one evolving variable. All variables in a VAR enter the
model in the same way: each variable has an equation explaining its evolution
based on its own lagged values, the lagged values of the other model variables, and
an error term. VARs
have found broad application as the foundation of much dynamic macroeconomic
modelling.
The
Vector Autoregression (VAR) method models the next step in each time series
using an AR model. It is the generalization of AR to multiple parallel time
series, e.g., multivariate time series.
The
notation for the model involves specifying the order for the AR(p) model as
parameters to a VAR function, e.g., VAR(p).
The
method is suitable for multivariate time series without trend and seasonal
components.
Vector autoregressive moving average (VARMA)
VAR
modelling though widely used for dynamic modelling and forecasting of
multivariate macroeconomic systems, it has limitations. They are not closed
under aggregation, marginalization or the presence of measurement error.
Secondly, economic models often imply that the observed processes have a vector
autoregressive moving average (VARMA) representation with a non-trivial moving
average component. Hence a more general in with a moving average VARMA is used.
VARMA models forecast macroeconomic variables more accurately than VARs.
The
Vector Autoregression Moving-Average (VARMA) method models the next step in
each time series using an ARMA model. It is the generalization of ARMA to
multiple parallel time series, e.g. multivariate time series.
The
notation for the model involves specifying the order for the AR(p) and MA(q)
models as parameters to a VARMA function, e.g., VARMA(p, q). A VARMA model can
also be used to develop VAR or VMA models.
(The
method is suitable for multivariate time series without trend and seasonal
components.)
Vector Autoregression Moving-Average with Exogenous
Regressors (VARMAX)
The
Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX) is an
extension of the VARMA model that also includes the modelling of exogenous
variables. It is a multivariate version of the ARMAX method.
The
observations for exogenous variables are included in the model directly at each
time step and are not modelled in the same way as the primary endogenous
sequence (e.g. as an AR, MA, etc. process).
The
VARMAX method can also be used to model the subsumed models with exogenous
variables, such as VARX and VMAX.
(The
method is suitable for multivariate time series without trend and seasonal
components with exogenous variables.)
Single Exponential Smoothing (SES)
With
the moving average (MA) a subsequent time step value was modelled with a set of
observed values which are equally weighted. Also called Simple Exponential
Smoothing (SES) method models the next time step as an exponentially decreasing
weighted linear function of observations at prior time steps. Recent
observations are given more weight than older observations for forecasting
subsequent values.
(The
method is suitable for univariate time series without trend and seasonal
components). This is the basic equation of exponential smoothing and the
constant parameter 0 ≤ α ≤ 1
is called the smoothing constant. This effect an overall smoothing similar to
the MA method.
Figure-15:
Double Exponential Smoothing
Single smoothing does perform well when the time series has a trend. This can be tackled by introduction of a second equation for trend
smoothing. The second equation is used in
conjunction with single exponential smoothing with α.
This constant used in the second equation is denoted by 0 ≤
Îł ≤ 1.
Holt Winter’s Exponential Smoothing (HWES)
When
the time series contains both trend and seasonal component a third equation is
used to smoothen out the seasonal component. The third equation a third
constant denoted by 0 ≤ β
≤ 1. The method is called Holt Winter’s Exponential
Smoothing (HWES)
also called the Triple Exponential Smoothing method. The method models the next
time step as an exponentially weighted linear function of observations at prior
time steps, taking trends and seasonality into account.
(The
method is suitable for univariate time series with trend and/or seasonal
components.)
Figure-16:
References:
1.
Jiawei Han,
Micheline Kamber, Jian Pei, “Data Mining
Concepts and Techniques”, Morgan Kaufmann, 3rd Ed.
2.
Chandler and Scott (2011). Statistical Methods for Trend
Detection and Analysis in the Environmental Sciences, Wiley
3.
https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm
Image Credits:
Figure-1: maplesoft.com
Figure-2: wikipedia.orgFigure-3: itfeature.com
Figure-4: fukamilab.github.io
Figure-5: alkaline-ml.com
Figure-6: java2s.com
Figure-7,8,9: images.squarespace-cdn.com
Figure-10: www.researchgate.net
Figure-10: www.researchgate.net
Figure-11: drg.blob.core.windows.net
Figure-12: i1.wp.com/techrando.com
Figure-13: miro.medium.com
Figure-14: miro.medium.com
Figure-15: researchgate.net
Figure-16: i0.wp.com/www.real-statistics.com
Comments
Post a Comment