9/30/2012

Is X related to Y?

As we are investors and speculators, we always want to forecast the future, so we would like to find correlations between Xs and Y.  Xs are any kind of historical data that could come from macro-economic indicators, weather reports, fundamental analysis, technical indicators, and almost anything that you can think of.  Y is what we want to forecast, most are like stock prices, prices of commodities, or market indices.

So what we usually do is to find the "correlation" between X and Y, defined as corr(X,Y).  Basically we can find all the corr(Xi,Y).  So if corr(Xi,Y) is closer to 1 or -1, we might think there is some kind of correlation between X and Y, assuming you have enough samples of data points.  What if corr(Xi,Y) is almost zero?  Is there no correlation at all? or no relationships at all?

When we exam all the corr(Xi,Y), we tempt to pick those value closer to 1 or -1, and use those Xi to do more advanced analysis.  We used to neglect the Xi, whose corr(Xi,Y) is close to zero.  Is this approach valid?

Let's take a look at a simple example, although corr(Xi,Y) is close to zero, actually Xi is "correlated" to Y, or the causation does exist.

Let's say, Y=XOR(X1,X2), X1 and X2 are random numbers of -1 and 1 with probability prob(-1)=prob(1)=0.5.  corr(X1,Y) and corr(X2,Y) are both zero.  But the relationship between (X1,X2) and Y does exist.

Most people who know XOR know how this trick works, but in real life, it's hard to imagine how this could be applied to data analysis.

A weird thing could happen is like that, when we try to find all the possible Xs:=X to forecast Y, and we define Y as f(X), f could be any kind of function, and we calculate corr( f(X),Y).  We want to find the best f, such that f makes corr(f(X),Y) close to 1.  So we can use f(X) to do some forecast.  As we know, X might not be complete, there might be some Xu not observed, or ignored due to corr(Xu,Y) is close to zero.  And corr(fu(X),Y) might be close to zero for some fu, either.  But actually Y=XOR(Xu,fu(X)).  If we ignore Xu or fu, there is no chance to find this relationship.

It means we have to pay attentions to those Xu and fu, even corr(Xu,Y) or corr(fu(X),Y) is close to zero.  So having no correlation(individually) could mean having correlation(communally)?

How many Xu and fu are there to check?


No comments :