Discussion thread for Correlation does not imply causation.
I think there is a problem here. I am using Numpy to compute this (using numpy.cov(x,y)[0][1] and numpy.std(x)), but the answer is off by a small margin.
Additionally, the terminology used here doesn’t match what I’m seeing used elsewhere – namely, in the Numpy documentation and Wikipedia. Where the term “variance” is used, “standard deviation” should be; the standard deviation is the square root of the variance.
Hey @kirktopode thank you for pointing out the errors in the description! Indeed it should be standard deviation and not variance.
Even worse, the formula we put up there isn’t either of them, it’s the total sum of squares which is a more useless quantity! The formula for the covariance needs to be fixed as well as it’s not normalized by the number of observations. You get the right answer if you calculate things exactly as the description states them, but we’re not calculating the covariance and standard deviation anymore!
I will fix the description to actually use the proper definitions of the covariance and the standard deviation. Thanks again for pointing this stuff out! Sorry for the unnecessary confusion.
To answer why your answer was off by a little margin, it seems that numpy.cov
uses an unbiased estimator for the covariance so that bias=False
by default which normalizes things by N-1
instead of N
. I guess this is nice for getting an unbiased estimate for the covariance but this is bad for calculating the correlation coefficient as it won’t be between -1 and 1 anymore (and give the wrong answer). Statistics is a weird world!
So I think to properly calculate the correlation coefficient using numpy.cov
you want to use numpy.cov(X, Y, bias=True)[0, 1]
. I should add this to the problem notes.
Alternatively, numpy.corrcoef
directly calculates the correlation coefficient
Thanks for replying so quickly!