"A complete data analysis will involve the following steps: (i) Finding a good model to fit the signal based on the data. (ii) Finding a good model to fit the noise, based on the residuals from the model. (iii) Adjusting variances, test statistics, confidence intervals, and predictions, based on the model for the noise.
"A key difference between a traditional statistical problems
and a time series problem is that often, in time series, the errors are not
independent."
"A stationary time series is one that has had trend elements (the signal) removed and that has a time invariant pattern in the random noise. In other words, although there is a pattern of serial correlation in the noise, that pattern seems to mimic a fixed mathematical model so that the same model fits any arbitrary, contiguous subset of the noise." (DeWayne R Derryberry, "Basic Data Analysis for Time Series with R" 1st Ed, 2014)
"A wide variety of statistical procedures (regression, t-tests, ANOVA) require three assumptions: (i) Normal observations or errors. (ii) Independent observations (or independent errors, which is equivalent, in normal linear models to independent observations). (iii) Equal variance - when that is appropriate (for the one-sample t-test, for example, there is nothing being compared, so equal variances do not apply).
"Both real and simulated data are very important for data analysis. Simulated data is useful because it is known what process generated the data. Hence it is known what the estimated signal and noise should look like (simulated data actually has a well-defined signal and well-defined noise). In this setting, it is possible to know, in a concrete manner, how well the modeling process has worked.
"Either a logarithmic or a square-root transformation of the data would produce a new series more amenable to fit a simple trigonometric model. It is often the case that periodic time series have rounded minima and sharp-peaked maxima. In these cases, the square root or logarithmic transformation seems to work well most of the time.
"For a confidence interval, the central limit theorem plays a role in the reliability of the interval because the sample mean is often approximately normal even when the underlying data is not. A prediction interval has no such protection. The shape of the interval reflects the shape of the underlying distribution. It is more important to examine carefully the normality assumption by checking the residuals […].
"Once a model has been fitted to the data, the deviations
from the model are the residuals. If the model is appropriate, then the
residuals mimic the true errors. Examination of the residuals often provides
clues about departures from the modeling assumptions. Lack of fit - if there is
curvature in the residuals, plotted versus the fitted values, this suggests
there may be whole regions where the model overestimates the data and other
whole regions where the model underestimates the data. This would suggest that
the current model is too simple relative to some better model.
"[The normality] assumption is the least important one for the reliability of the statistical procedures under discussion. Violations of the normality assumption can be divided into two general forms: Distributions that have heavier tails than the normal and distributions that are skewed rather than symmetric. If data is skewed, the formulas we are discussing are still valid as long as the sample size is sufficiently large. Although the guidance about 'how skewed' and 'how large a sample' can be quite vague, since the greater the skew, the larger the required sample size. For the data commonly used in time series and for the sample sizes (which are generally quite large) used, skew is not a problem. On the other hand, heavy tails can be very problematic.
"When data is not normal, the reason the formulas are working
is usually the central limit theorem. For large sample sizes, the formulas are
producing parameter estimates that are approximately normal even when the data
is not itself normal. The central limit theorem does make some assumptions and
one is that the mean and variance of the population exist. Outliers in the data
are evidence that these assumptions may not be true. Persistent outliers in the
data, ones that are not errors and cannot be otherwise explained, suggest that
the usual procedures based on the central limit theorem are not applicable.
"Whenever the data is periodic, at some level, there are only as many observations as the number of complete periods. This global feature of the data suggests caution in understanding more detailed features of the data. While a curvature model might be appropriate for this data, there is too little data to know this, and some skepticism might be in order if such a model were fitted to the data." (DeWayne R Derryberry, "Basic Data Analysis for Time Series with R" 1st Ed, 2014)
No comments:
Post a Comment