difference between correlation and independence, k-means clustering pros and cons
Anonymous
correlation : e[xy] - e[x]e[y], can be zero even if variables are not independent, usually can set up tricky rv that satisfies this independence: p(x,y) = p(x)p(y). automatically implies 0 cov (plug it in) k-means: pros: good when you know number of distinct clusters without too much overlap between. run-time calculation is p fast, just compare to centoids O(num_means * num_dimension). interpretable and can use custom distance functions. cons: needs distance function, hard when data is on differing magnitudes. training is always approximation, has to be trained, optimal solution is np-hard. training doesn't always converge, bad initial points can make clusters bad, hard to tell how many clusters is sufficient, cannot model complex clusters (think clusters of concentric rings)
Check out your Company Bowl for anonymous work chats.