ピアソン相関によるスコア計算
ピアソン相関係数の計算式は

です、値は-1から1の間をとり、完全に相関する場合は1、相関がない場合0、逆相関の場合は-1となる模様
Pythonのコード
映画の評価データは前回と同じものを利用していますのでそちらを参照してください
#!/usr/bin/python # -*- coding: utf-8 -*- from MovieEval import critics from math import sqrt def sim_pearson(prefs, person1, person2): si = {} for item in prefs[person1]: if item in prefs[person2]: si[item] = 1 n = len(si) if n == 0: return 0 sum1 = sum([prefs[person1][it] for it in si]) sum2 = sum([prefs[person2][it] for it in si]) sum1Sq = sum([pow(prefs[person1][it], 2) for it in si]) sum2Sq = sum([pow(prefs[person2][it], 2) for it in si]) pSum = sum([prefs[person1][it] * prefs[person2][it] for it in si]) num = pSum - (sum1 * sum2 / n) den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n)) if den == 0: return 0 r = num / den return r def calcScore(name, prefs): print("%s:" % name) for (nm, val) in prefs.items(): print(" %20s %20f" % (nm, sim_pearson(critics, name, nm))) print("--------------------------------------------------") if __name__ == '__main__': for (name, ev) in critics.items(): calcScore(name, critics)これを実行すると
cuomo@karky7 ~ $ python pearson.py Jack Matthews: Jack Matthews 1.000000 Mick LaSalle 0.211289 Claudia Puig 0.028571 Lisa Rose 0.490990 Toby 0.662849 Gene Seymour 0.963796 Michael Phillips 0.134840 -------------------------------------------------- Mick LaSalle: Jack Matthews 0.211289 Mick LaSalle 1.000000 Claudia Puig 0.566947 Lisa Rose 0.585491 Toby 0.924473 Gene Seymour 0.411765 Michael Phillips -0.258199 -------------------------------------------------- Claudia Puig: Jack Matthews 0.028571 Mick LaSalle 0.566947 Claudia Puig 1.000000 Lisa Rose 0.883883 Toby 0.893405 Gene Seymour 0.314970 Michael Phillips 1.000000 -------------------------------------------------- Lisa Rose: Jack Matthews 0.490990 Mick LaSalle 0.585491 Claudia Puig 0.883883 Lisa Rose 1.000000 Toby 0.991241 Gene Seymour 0.315264 Michael Phillips 0.774597 -------------------------------------------------- Toby: Jack Matthews 0.662849 Mick LaSalle 0.924473 Claudia Puig 0.893405 Lisa Rose 0.991241 Toby 1.000000 Gene Seymour 0.381246 Michael Phillips -1.000000 -------------------------------------------------- Gene Seymour: Jack Matthews 0.963796 Mick LaSalle 0.411765 Claudia Puig 0.314970 Lisa Rose 0.315264 Toby 0.381246 Gene Seymour 1.000000 Michael Phillips 0.204598 -------------------------------------------------- Michael Phillips: Jack Matthews 0.134840 Mick LaSalle -0.258199 Claudia Puig 1.000000 Lisa Rose 0.774597 Toby -1.000000 Gene Seymour 0.204598 Michael Phillips 1.000000 -------------------------------------------------- cuomo@karky7 ~ $
Haskellでやると
評価データは前回のものを利用していますimport qualified Data.Map as M import MovieEval import Control.Monad (forM_) import Text.Printf sim_distance :: M.Map String [Eval] -> String -> String -> Double sim_distance perfs person1 = \person2 -> sim_pearson p1 (M.lookup person2 perfs) where p1 = M.lookup person1 perfs sim_pearson :: Maybe [Eval] -> Maybe [Eval] -> Double sim_pearson (Just p1) (Just p2) = if den /= 0 then num / den else 0 where ev = map (\(v, w) -> (movEv v, movEv w)) [(x, y) | x <- p1, y <- p2, movName x == movName y] n = length ev xs = map fst ev ys = map snd ev sumx = sum xs sumy = sum ys sumxSq = sum[x*x | x <- xs] sumySq = sum[y*y | y <- ys] pSum = sum[x*y | (x, y) <- ev] num = pSum - (sumx * sumy / realToFrac n) den = sqrt ((sumxSq-(sumx*sumx) / realToFrac n)*(sumySq-(sumy*sumy) / realToFrac n)) sim_pearson _ _ = 0.0 calcScore :: String -> [String] -> (String -> String -> Double) -> IO() calcScore name names f = do putStrLn $ name ++ ":" forM_ names $ \nm -> do printf " %20s %20f\n" nm (f name nm) putStrLn "--------------------------------------------------\n" main :: IO() main = do let perfs = getCritics names = M.keys perfs mapM_ (\name -> calcScore name names (sim_distance perfs)) names
cuomo@karky7 ~ $ runhaskell peason.hs Claudia Puig: Claudia Puig 1.0 Gene Seymour 0.31497039417435607 Jack Matthews 0.02857142857142857 Lisa Rose 0.883883476483186 Michael Phillips 1.0 Mick LaSalle 0.5669467095138411 Toby 0.8934051474415647 -------------------------------------------------- Gene Seymour: Claudia Puig 0.31497039417435607 Gene Seymour 1.0 Jack Matthews 0.963795681875635 Lisa Rose 0.31526414437773115 Michael Phillips 0.20459830184114206 Mick LaSalle 0.41176470588235276 Toby 0.38124642583151164 -------------------------------------------------- Jack Matthews: Claudia Puig 0.02857142857142857 Gene Seymour 0.963795681875635 Jack Matthews 1.0 Lisa Rose 0.49099025303098176 Michael Phillips 0.13483997249264842 Mick LaSalle 0.21128856368212925 Toby 0.66284898035987 -------------------------------------------------- Lisa Rose: Claudia Puig 0.883883476483186 Gene Seymour 0.31526414437773115 Jack Matthews 0.49099025303098176 Lisa Rose 1.0 Michael Phillips 0.7745966692414834 Mick LaSalle 0.5854905538443589 Toby 0.9912407071619299 -------------------------------------------------- Michael Phillips: Claudia Puig 1.0 Gene Seymour 0.20459830184114206 Jack Matthews 0.13483997249264842 Lisa Rose 0.7745966692414834 Michael Phillips 1.0 Mick LaSalle -0.2581988897471611 Toby -1.0 -------------------------------------------------- Mick LaSalle: Claudia Puig 0.5669467095138411 Gene Seymour 0.41176470588235276 Jack Matthews 0.21128856368212925 Lisa Rose 0.5854905538443589 Michael Phillips -0.2581988897471611 Mick LaSalle 1.0 Toby 0.9244734516419049 -------------------------------------------------- Toby: Claudia Puig 0.8934051474415647 Gene Seymour 0.38124642583151164 Jack Matthews 0.66284898035987 Lisa Rose 0.9912407071619299 Michael Phillips -1.0 Mick LaSalle 0.9244734516419049 Toby 1.0 -------------------------------------------------- cuomo@karky7 ~ $こんな感じで2人の間の類似性が簡単な計算で求められたりするとなかなかおもしろいですね。チョット気になるのが相関値が同一人物でないのに1.0で出てるのが、計算の誤差なんでしょうかね?....pythonでやってもhaskellでやっても1.0なのでいいとは思うのですが。
後で得意の電卓でやってみます 笑...
0 件のコメント:
コメントを投稿