But back to the subject of what happens to the data we acquire everyday through our sensors [eyes, ears, nose etc]. One important fact very often forgotten is that before data can be turned into information, it needs to be in a form that can be digested by the computer. This is because, when we feed in data for one model, there will most probably be very diverse types. One variable may be in single digits rounded to two decimal places [e.g. 1.25]. Others may take the form of being zeros and ones [e.g. Yes and No answers to a question in a survey], in millions or thousands [e.g. 5,000,000 the volume of trades on a stock], percentages etc

Here are some processes that make data digestible for a computer.

**Filtering and De-noising:**attempts to filter out elements of a data stream that are not wanted. For example in data on electrical signals generated by a piece of equipment, you might want to filter out frequencies of resonance such as a 60khz hum. In a stock’s, monthly price chart that includes intra-day trades, you might want to just take the daily Closing price. Besides the normal filters common to Fourier analysis and signal processing, we now have more advanced filters such as from Wavelet theory, which are deemed to be more efficient at retaining information that is useful and which the filters of traditional digital processing sometimes delete.

**Clustering.**If you can find a way to lessen the number of categories that the data fits into, that is also good. Less categories means less complex modeling and less work for the computer. One rule of modeling that holds true is that a model should be as simple as it can be. Bells and whistles only add to the noise. So clustering is a big help. There are so many methods for clustering that it is a whole subject of intense interest by itself. From simple methods based on Euclidean mathematics [such as nearest neighbor and k-means methods] that measure the Euclidean ‘distance’ between data values as a criteria for clustering, to non-Euclidean methods used in Fuzzy Logic and Radial Basis Neural Networks.

**Removal of Outliers :**It is quite common to have a few extreme values in data, which are not representative of the data. These are called outliers and should be removed. The usual way is to see if the data approximately fits a Normal [Gaussian] distribution, and take away those data values which are > 2 standard deviations from the Mean. [Data values which are less than 2std from the mean = 99% of the data. But a more suitable way and more efficient way may be to use your eyes to view a data distribution and cull the outliers. On the other hand, there are cases where values > 2 standard deviation are useful and should not be culled. These are in cases where the distribution is leptokurtic. A leptokurtic distribution has a higher peak than a Normal distribution, but also has fatter tails. This shows that extreme values occur more frequently. Leptokurtosis is common in data on financial markets, meaning that cases of volatility are more common and due to positive feedback effects, are reinforced. Deleting outliers in financial data can affect good modeling behavior. Recently, fat-tailed distributions have been noted in consumer demand, indicating that catering to extremes in demand and income can be a profitable niche.

**Smoothing:**is a way to take away non-essential features from a data stream. It is common in time series data where a Moving Average can be applied to smooth the data, and focus your attention on the trend. There are so many types of Moving averages from Simple, to Weighted and Exponential. The formula for an exponential MA gives more weight to recent data. All MA’s have lags, i.e. while they smooth, they also take away some information; so using them to extrapolate is useless. However there are some Companies that sell MA’s developed with the latest in digital signal processing which supposedly have less lag and can be used for prediction. Smoothing can also be done by fitting a line through the data. The simplest fit is a Least-Squares regression. But if the data goes up and down a lot, than non-linear, quadratic or piece-meal fitting methods can be applied. Non-parametric methods for fitting such as with a Neural Network are also available. However, we should caution that over-fitting i.e. putting a fit line to data that follows its ups and downs too closely is also bad. If you are using the data for prediction, over-fitting is disastrous, since it gives spurious accuracy to your prediction. Much better to have a less fitting line which gives a more generalized output that has leeway for the vagaries, randomness and noise that we find in the real world. Recently there has been research on the use of sparse data to fit, and forcing a machine learning algorithm such as a Back Propagation or Recurrent Neural Network to generalize, so that it can predict better when faced with a different situation.

**Dimension Reduction**Too much data can be bad. Good mathematical modeling involves picking just the variables that the modeler feels are the necessary ones to make the model as realistic a representation of the real world. For example, if you are making a model to predict the Dow, data such as on interest rates, Corporate earnings, strength of the US$, price of Oil etc might be considered relevant. But data on the weather, on industrial production in India or amount of dog food sold won’t work much on the model. Thus modeling requires domain expertise. An expert who has practical knowledge of the Asian markets [and knows what variables affect it] is more likely to construct a better model of the Hong Kong stock Exchange than an academic modeler. Even when the number of variables is reduced, there will still be variables which are strongly correlated with each other with a lot of common information content. For example, if you feed in the closing prices of the DJIA into a stock market data, it is unnecessary to feed in the closing prices of NASDAQ or S&P 500. Though the three historical series are not exactly the same, their movements are very strongly correlated. One method to take away information that ‘overlaps’ in variables is to orthogonalize them via Principle Components analysis. . That way you will be left only with inputs that have no common information.

**Normalization and Standardization**Normalization is the process of converting several input variables into a common format. Usually this means converting all data into the range of zero to 1 or –1 to +1. For example, if you have price data from Japanese stock market which is in Yen, U.S. market which is in US$ and European market which is in Euros, it is better to normalize them all to range from zero to 1. This makes it more digestible to the computer. An example for standardization is when you convert several streams of data into percentages. Again using the example of a model for financial markets, you may have the daily change of the Nikkei in much bigger absolute values than the S&P500. So converting these changes into % makes it much more digestible for the computer. It also gives you a better idea of the daily changes. Dropping 100 points on the Nikkei is very different from dropping 100 points on the S&P500.

**Detrending;**Opposite to the aim of smoothing and de-noising which is to help you see the trend, de-trending is to get rid of the trend and help you see the cycles. This technique is useful for looking at cyclical data such as when studying business cycles and sales of products which are seasonal. Data on commodities, raw materials and those, which inherently are prone to cycles, need also to be de-trended. Differencing the data does de-trending. For example, a time series which goes: 11064, 10981, 10967, 11001, 11089, 11120 … can be detrended to become

–83, -14, +34, + 31. This in turn can be further converted to % changes. De-trending also makes the data stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time. In order to make any kind of statistical inference from a single realization of a random process, stationarity of the process is often assumed. In practice strict stationarity is hard to find, but there are now mathematical techniques that can accommodate weak or second order stationarity. One aspect of data to remember is that most time series data have autocorrelation characteristics. That is, the value for T+1, where T=time is dependent on T-1.

**Compression:**One important technique in data pre-processing is Compression. Even after differencing, data valued can be converted into Natural or Base 10 logarithms. Compression is sutiable for series with runaway exponential type of growth. The data can later be decompressed if necessary. Compression is also a type of normalization. One way that compression is done is by passing it through a transform function of a pre-determined shape. Neural Networks, for example, transform the inputs via a function of in the Log-Sigmoid or other shapes. The shape of the curve transforms the input in varying degrees depending on which part of the curve the data values lie on.

In much of Econometrics forecasting [ ARIMA, ARMA, ARCH, GARCH etc] a data stream is defined as being comprised of : a Moving Average [MA] component, an Autocorrelation Component [AC] and an Error/Random/Noise component.[E] Estimating the coefficients for MA and AC are simpler than for E. One solution is for a continuous-updating-and-adjustment-through-time technique such as Kalman filters and Stateflow analysis.

After going through all this, I just want to say that our brain does all the above without you being conscious of it, and does it not in such explicit steps as elucidated. Which goes to show how great it is, and whether Ray Kuwzweill's vision of a machine matching the power of the human brain in 2030 will materialize.

## No comments:

## Post a Comment