Discussion thread for Correlation does not imply causation.

I think there is a problem here. I am using Numpy to compute this (using numpy.cov(x,y)[0][1] and numpy.std(x)), but the answer is off by a small margin.

Additionally, the terminology used here doesnâ€™t match what Iâ€™m seeing used elsewhere â€“ namely, in the Numpy documentation and Wikipedia. Where the term â€śvarianceâ€ť is used, â€śstandard deviationâ€ť should be; the standard deviation is the square root of the variance.

Hey @kirktopode thank you for pointing out the errors in the description! Indeed it should be *standard deviation* and not *variance*.

Even worse, the formula we put up there isnâ€™t either of them, itâ€™s the total sum of squares which is a more useless quantity! The formula for the covariance needs to be fixed as well as itâ€™s not normalized by the number of observations. You get the right answer if you calculate things exactly as the description states them, but weâ€™re not calculating the covariance and standard deviation anymore!

I will fix the description to actually use the proper definitions of the covariance and the standard deviation. Thanks again for pointing this stuff out! Sorry for the unnecessary confusion.

To answer why your answer was off by a little margin, it seems that `numpy.cov`

uses an unbiased estimator for the covariance so that `bias=False`

by default which normalizes things by `N-1`

instead of `N`

. I guess this is nice for getting an unbiased estimate for the covariance but this is bad for calculating the correlation coefficient as it wonâ€™t be between -1 and 1 anymore (and give the wrong answer). Statistics is a weird world!

So I think to properly calculate the correlation coefficient using `numpy.cov`

you want to use `numpy.cov(X, Y, bias=True)[0, 1]`

. I should add this to the problem notes.

Alternatively, `numpy.corrcoef`

directly calculates the correlation coefficient

Thanks for replying so quickly!