有意差とは何か、シミュレーションで調べる

はじめに

いろんな分野で統計学に基づいた有意差検定が使われています。有意差検定の方法は、さまざまな書籍、たくさんのネット記事に紹介されています。また、RやPythonなどのツールを使えば、手軽に有意差検定をすることができます。

しかし、手持ちのデータに有意差検定を行って「有意差あり」「有意差なし」が分かったとしても、それがいったいどういうことなのか、直感的に理解するのは難しいかも知れません。

この記事では、Pythonを使ってさまざまなデータをシミュレーションで作り、さらに有意差検定を行ってみて、どういうときに有意差があり、どういうときに有意差が認められないのか、実験的に調べてみます。

Note

今回は、データの母集団が同一標準偏差の正規分布に従う場合に着目します。そして、母集団から得られるサンプルの平均値に対する有意差検定を掘り下げます。

なお、本記事のPython スクリプトの動作確認は、MacのPython 2.7.15 (Anaconda)で行いました。ライブラリにnumpy, scipy, matplotlibを使っています。

例題

まず、有意差検定をする以下の例題を考えます。

問1

同じ製品を作る工場A, Bにて、それぞれ10個の製品サンプルを採取して重さを測ったところ、以下の表のとおりになった。Aの製品の重さの平均と、Bの製品の重さの平均には、有意な差があるのだろうか？

A	B
99.46 g	99.29 g
100.48 g	99.17 g
103.25 g	97.63 g
98.98 g	98.14 g
99.42 g	99.14 g
100.12 g	100.56 g
100.30 g	98.73 g
100.52 g	100.12 g
100.00 g	98.94 g
101.34 g	100.33 g

問2

（問題文は問1と同じで、表の値が違います）

A	B
99.80 g	102.01 g
100.48 g	99.70 g
99.48 g	101.27 g
99.44 g	101.23 g
101.97 g	102.35 g
101.39 g	101.89 g
100.09 g	99.00 g
100.28 g	100.63 g
100.77 g	102.67 g
101.25 g	100.56 g

回答

それでは、有意差検定をやってみます。ここではいくつかの仮定を置きます。

A, Bともに、重さの母集団は正規分布に従うこととします。
A, Bの母集団の標準偏差は同一とします。

検定方法には、t検定のうち「独立2群の平均値の差の検定」を使います。有意水準は5%とします。

問1の回答

Pythonのscipyライブラリを使うと、手軽にt検定を行うことができます。

problem1.py

from scipy import stats
import numpy as np
# values of A
a = np.array([99.46, 100.48, 103.25, 98.98, 99.42, 100.12, 100.30, 100.52, 100.00, 101.34])
# values of B
b = np.array([99.29, 99.17, 97.63, 98.14, 99.14, 100.56, 98.73, 100.12, 98.94, 100.33])
# obtain p-value
p = stats.ttest_ind(a, b).pvalue
print "p = {}".format(p)

このスクリプトを実行すると、以下のp値が出てきます。

$ python problem1.py
p = 0.0248874728426

p値が有意水準(0.05)未満ならば、「有意差あり」と解釈されます。よって、AとBの平均値には、有意な差があったということになります。

問2の回答

さきほどと同じスクリプトが使えますが、データは問2の値に変えます。

problem2.py

from scipy import stats
import numpy as np
# values of A
a = np.array([99.80, 100.48, 99.48, 99.44, 101.97, 101.39, 100.09, 100.28, 100.77, 101.25])
# values of B
b = np.array([102.01, 99.70, 101.27, 101.23, 102.35, 101.89, 99.00, 100.63, 102.67, 100.56])
# obtain p-value
p = stats.ttest_ind(a, b).pvalue
print "p = {}".format(p)

スクリプトを実行すると、以下のp値が出てきます。

$ python problem2.py
p = 0.181649846941

p値が有意水準(0.05)以上なので、「有意差なし」と解釈されます。よって、AとBの平均値には、有意な差は認められなかったということになります。

実際は・・・

例題の表の値は、正規分布からサンプリングして作ったデータです。

問1の表の値は、A, Bともに、平均100, 標準偏差1の正規分布からサンプリングしたものです。つまり、本来はAとBの母集団には差がないのに、有意差検定では「有意差あり」と判定されています。この現象は「第1種過誤」と呼ばれます。

問2の表の値は、Aは平均100、Bは平均101の正規分布からサンプリングしたものです（標準偏差はともに1）。つまり、本来はAとBの母集団には差があるにもかかわらず、有意差検定では「有意差なし」と判定されています。この現象は「第2種過誤」と呼ばれます。

それぞれの表を作ったスクリプトも紹介しておきます。

問1の表の作成

make_problem1.py

import numpy as np
from scipy import stats

def main():
    # fix the random seed to reproduce same results 
    np.random.seed(12345)
    # set the level of significance
    sig_level = 0.05

    while True:
        # get 10 samples from a gaussian distribution (average=100, standard deviation=1),
        # and regard them as A
        a = np.random.normal(loc=100, size=(10))
        # get 10 samples from the same gaussian distribution,
        # and regard them as B
        b = np.random.normal(loc=100, size=(10))

        # do t-test between A and B
        res = stats.ttest_ind(a, b)
        # if the test result indicates "significant difference" (i.e. type 1 error), then ends.
        if res.pvalue < sig_level:
            print "type-1 error happened."
            # print sampled values of A and B
            print "       A       B"
            for i in xrange(10):
                print " %7.2f %7.2f" % (a[i], b[i])
            break

if __name__ == '__main__':
    main()

実行結果

$ python make_problem1.py
type-1 error happened.
       A       B
   99.46   99.29
  100.48   99.17
  103.25   97.63
   98.98   98.14
   99.42   99.14
  100.12  100.56
  100.30   98.73
  100.52  100.12
  100.00   98.94
  101.34  100.33

問2の表の作成

make_problem2.py

import numpy as np
from scipy import stats

def main():
    # fix the random seed to reproduce same results
    np.random.seed(12345)
    # set the level of significance
    sig_level = 0.05

    while True:
        # get 10 samples from a gaussian distribution (average=100, standard deviation=1),
        # and regard them as A
        a = np.random.normal(loc=100, size=(10))
        # get 10 samples from a gaussian distribution (average=101, standard deviation=1),
        # and regard them as B
        b = np.random.normal(loc=101, size=(10))

        # do t-test between A and B
        res = stats.ttest_ind(a, b)
        # if the test result indicates "no significant difference" (i.e. type 2 error), then ends.
        if res.pvalue >= sig_level:
            print "type-2 error happened."
            # print sampled values of A and B
            print "       A       B"
            for i in xrange(10):
                print " %7.2f %7.2f" % (a[i], b[i])
            break

if __name__ == '__main__':
    main()

実行結果

$ python make_problem2.py
type-2 error happened.
       A       B
   99.80  102.01
  100.48   99.70
   99.48  101.27
   99.44  101.23
  101.97  102.35
  101.39  101.89
  100.09   99.00
  100.28  100.63
  100.77  102.67
  101.25  100.56

どういうときに「有意差あり」「有意差なし」になるのか？

例題のように、実際には母集団に差がなくても「有意差あり」になるケース、実際には母集団に差があっても「有意差なし」になるケースがあることがわかりました。それでは、実際にどんなときに「有意差あり」になって、どんなときに「有意差なし」になるのでしょう？

これを調べるため、次の実験をします。

実験方法

2個の正規分布A, Bから、それぞれ同個数のサンプルを採取します。

そして、Aのサンプルの平均値、Bのサンプルの平均値の間に、有意差があるかを検定します。

検定方法は、例題と同じく、t検定のなかの「独立2群の平均値の差の検定」を使います。有意水準は5%とします。

実験条件

Aの母集団、すなわちAのサンプルを採取する正規分布は、平均値0、標準偏差1とします。
Bの母集団、すなわちBのサンプルを採取する正規分布には、いろんな平均値のものを使います（0, 0.1, 1, 10）。標準偏差はつねと1にします。
AとBのサンプル数もいろいろと変えてみます（3〜10,000）。ただし、AとBのサンプル数はつねに同一とします。

もし、Bの母集団の平均値が0であれば、それはAの母集団と同じですので、採取されたサンプルに対しても「有意差なし」となることが期待されます。一方、Bの母集団の平均値が0以外であれば、Aの母集団と異なりますので、「有意差あり」となることが期待されます。

母集団となる正規分布を見てみる

まず、標準偏差が1で、平均値が0〜10の正規分布がどのような形になるか、下の図で見てみましょう。

f:id:uenewsar:20181211225827p:plain

凡例"ave"が正規分布の平均値を示します。

平均0の正規分布（青）と平均10の正規分布（赤）は、山の位置が大きく異なります。そのため、これらの正規分布から採取されるサンプルの値にも大きな差が生まれますので、有意差は簡単に出そうです。

一方、平均0の正規分布（青）と平均0.1の正規分布（オレンジ）は、山の位置がほとんど重なっています。そのため、これらの正規分布から採取されるサンプルの値にも、ほとんど差がないでしょう。よって、有意差はあまり出そうにありません。

「有意差あり」になる確率を見てみる

それでは、実験をしてみます。この実験では、AとBのサンプルの平均値に有意差が認められる確率を求めます。

それぞれの実験条件において、「Aからサンプルを採取する → Bからサンプルを採取する → 有意差検定する」というプロセスを5,000回繰り返して、そのうち何回が「有意差あり」だったかをカウントします。

そして、この「有意差あり」の回数を5,000で割ることで、AとBの平均値に有意差が認められる確率を求めます。

下の図が、その結果です。

f:id:uenewsar:20181212123341p:plain

縦軸は、有意差が認められた確率です。横軸が、A, Bそれぞれのサンプル数です。それぞれの線は、異なるBの母集団の平均値に対応しています。すなわち、それぞれの線で、AとBの母集団の平均値の差が異なります（凡例"diff"）。

最初に、A, Bの母集団の平均値が同一の場合を見ます（青）。期待としては、AとBの母集団は同一ですので、「有意差なし」となってほしいです。しかし、実際には、有意水準5%の確率で「有意差あり」という間違った結果が得られることが分かります（第1種過誤）。この第1種過誤が起こる確率は、サンプル数が多くなっても変わりません。どんなに多くのサンプルを集めようとも、設定した有意水準の確率で、間違った検定結果が得られる可能性があることを示しています。

次に、A, Bの母集団の平均値が異なる場合を見ます（オレンジ、緑、赤）。平均値の差が10と大きい時（赤）、サンプル数によらず、ほぼ1の確率で「有意差あり」と判定されています。しかし、平均値の差が小さくなると、サンプル数が少ない時には「有意差あり」と判定される確率が低くなっていきます。すなわち、母集団に差があるのに「有意差なし」と判定されていますので、第2種過誤が発生している状況となります。

とくに、平均値の差が0.1のときには（オレンジ）、サンプル数を相当集めないと、なかなか「有意差あり」になりません。サンプル数を10,000個集めて、ようやくほぼ1の確率で「有意差あり」という判定となります。

このグラフを作るのに使ったPython スクリプトは以下になります。グラフが出るまで少し時間がかかります（2.3GHz Intel Core i5で2分）。

prob_of_error.py

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def calc(num_sample, diff, sig_level):
    """
    calculate the ratio when significant differences are observed between two sets of samples.
    @param num_sample the number of samples to be picked up from a distribution.
    @param diff the difference of averages of two gaussian distributions.
    @param sig_level the level of significance.
    @return the ratio of cases when significant differences are observed between the avagages of two sets of samples.
    """

    # do 10000 trials to obtain a trustable ratio when significant differences are observed.
    num_trial = 10000
    num_significant = 0

    for i in xrange(num_trial):
        # get samples from a gaussian distribution (average=0, stdev=1),
        # and regard them as A.
        a = np.random.normal(loc=0, size=(num_sample))
        # get samples from a gaussian distribution (average=diff, stdev=1),
        # and regard them as B.
        b = np.random.normal(loc=diff, size=(num_sample))
        # do t-test between A and B.
        res = stats.ttest_ind(a, b)
        # if p-value is lower than the level of significance,
        # it means the significant difference was observed between A and B.
        # so increments the counter.
        if res.pvalue < sig_level:
            num_significant += 1

    # calculate the ratio when significant differences are observed.
    sig_ratio = float(num_significant)/num_trial

    return sig_ratio

def main():

    # fix the random seed to reproduce same results always.
    np.random.seed(12345)
    # differences of average values of distributions.
    diff_list = [0, 0.1, 1, 10]
    # numbers of samples to pick up from each distribution.
    num_sample_list = [3, 10, 30, 100, 300, 1000, 3000, 10000]

    # level of significance.
    sig_level = 0.05
    # initialize dictionary to store results.
    res = {}

    # for debug
    num_iter = len(diff_list) * len(num_sample_list)
    counter = 0

    for e_diff in diff_list:
        # initialize dictionary to store results.
        res[e_diff] = []

        for e_num_sample in num_sample_list:

            # calculate the ratio when the significant difference is observed
            # given the difference of averages of two distributions and the number of samples.
            sig_ratio = calc(e_num_sample, e_diff, sig_level)
            print "{}/{} sig_level={}, diff={}, num_sample={}, sig_ratio={}".format(counter+1, num_iter, sig_level, e_diff, e_num_sample, sig_ratio)
            
            # store the ratio.
            res[e_diff].append(sig_ratio)

            counter += 1

    # plot.
    plt.xscale("log")
    for e_diff in sorted(res.keys()):
        plt.plot(num_sample_list, res[e_diff], label="diff={}".format(e_diff), marker='o')
    plt.xlabel("number of samples [-]")
    plt.xlim((2, 30000))
    plt.ylabel("ratio when significant difference is observed [-]")
    plt.legend()
    plt.show()

if __name__ == '__main__':
    main()

実行結果

$ python prob_of_error.py
1/32 sig_level=0.05, diff=0, num_sample=3, sig_ratio=0.0501
2/32 sig_level=0.05, diff=0, num_sample=10, sig_ratio=0.0533
3/32 sig_level=0.05, diff=0, num_sample=30, sig_ratio=0.0535
        ...
31/32 sig_level=0.05, diff=10, num_sample=3000, sig_ratio=1.0
32/32 sig_level=0.05, diff=10, num_sample=10000, sig_ratio=1.0
(最後にグラフが表示される)

まとめ

この記事では、有意差がどういうときに認められ、どういうときに認められないかを直感的に理解するため、いろんなデータに対して有意差検定を行いました。今回は、データの母集団が同一標準偏差の正規分布に従い、平均値のみが異なる場合に着目し、母集団から得られるサンプルの平均値に対する有意差検定を掘り下げました。

その結果、以下のことが分かりました。

2群のサンプルが同一の母集団から採取されたとしても、有意水準相当の確率で、「有意差が認められる」ことがある。
2群のサンプルが異なる母集団から採取されたとしても、サンプル数が少ないと、「有意差が認めらない」ケースが増える。とくに母集団の平均値の差が小さい場合、サンプル数を相当増やさないと、有意差が出ない。