[問題] 使用R輔助基礎機率論教學
[問題類型]:
程式諮詢(我想用R 做某件事情,但是我不知道要怎麼用R 寫出來)
[軟體熟悉度]:
入門(寫過其他程式,只是對語法不熟悉)
[問題敘述]:
弱大數法則我還勉強能用丟銅板來"實驗"給學生看...
但是CLT要怎麼做呢= =
所以想用R來做電腦輔助教學...
而且我想示範:
1.不是n>=30就叫做大樣本。
所以除了常態分配和長得比較對稱的均勻分配以外,
其他都選取很極端的機率分佈,
結果n=30之下作Jarque Bera test都拒絕虛無假設。
2.期望值和變異數存在且有限才可適用CLT。
所以我故意選了Cauchy分配來示範。
(抽自柯西分配母體的樣本其平均仍然是柯西分配)
不過,我最後用ks.test的結果怪怪的,
是不是參數沒設定好?(難道沒辦法讓他自己抓常態分配的參數嗎= =)
(用SAS和SPSS都不用調參數....是不是用R作KS檢定一定要指定參數?)
[程式範例]:
#請記得先安裝TSA package
library(TSA)
#X~N(0,1)
set.seed(1)
par(mfrow=c(1,2))
popnorm<-rnorm(n=100000,mean=0, sd=1)
mean(popnorm)
var(popnorm)
plot(density(popnorm))
meannorm<-array(0,dim=c(1000))
for (i in 1:1000) {
meannorm[i]=mean(sample(popnorm,size=30,replace=TRUE))
}
mean(meannorm)
var(meannorm)
plot(density(meannorm))
ks.test(meannorm,"pnorm")
jarque.bera.test(meannorm)
結果:
> ks.test(meannorm,"pnorm")
One-sample Kolmogorov-Smirnov test
data: meannorm
D = 0.3406, p-value < 2.2e-16
alternative hypothesis: two-sided
> jarque.bera.test(meannorm)
Jarque Bera Test
data: meannorm
X-squared = 4.9045, df = 2, p-value = 0.0861
最後改用這個:
mean<-mean(meannorm)
sd<-sqrt(var(meannorm))
ks.test(meannorm,"pnorm",mean,sd)
結果:
One-sample Kolmogorov-Smirnov test
data: meannorm
D = 0.029, p-value = 0.3692
alternative hypothesis: two-sided
沒有違背預期的結果。
[關鍵字]:
電腦輔助教學, 機率, 中央極限定理
附註:以下試驗其他機率分佈
http://ideone.com/UkhaJO
為節省版面,貼到網路上供參考。
--
※ 發信站: 批踢踢實業坊(ptt.cc)
◆ From: 111.255.26.132
→
12/15 08:05, , 1F
12/15 08:05, 1F
→
12/15 08:07, , 2F
12/15 08:07, 2F
→
12/15 11:50, , 3F
12/15 11:50, 3F
→
12/15 11:50, , 4F
12/15 11:50, 4F
→
12/15 12:02, , 5F
12/15 12:02, 5F
mean<-mean(meannorm)
sd<-sqrt(var(meannorm))
↑這裡的SD=sqrt(樣本變異數)=sqrt(母體變異數/30)=母體標準差/sqrt(30)
ks.test(meannorm,"pnorm",mean,sd)
結果:
One-sample Kolmogorov-Smirnov test
data: meannorm
D = 0.029, p-value = 0.3692
alternative hypothesis: two-sided
這個部份的sd應該不用再除sqrt(30)
因為樣本的變異數已經是母體變異數除以30了。
> mean(popnorm)
[1] -0.002244083
> var(popnorm)
[1] 1.007059
> mean(meannorm)
[1] 0.002210398
> var(meannorm)
[1] 0.03560481
其中抽出樣本的變異數約為母體變異數/30
> var(popnorm)/30
[1] 0.03356863
所以使用CLT的時候,標準差不用再除以sqrt(30)囉...
主要的問題是,不曉得有沒有用KS檢定卻不需指定參數的方法?
類似SAS和SPSS的那種...
雖然用JB檢定是滿方便的,可是初等統計教科書大多沒寫這方法...
另外,抽樣的時候設定replace=TRUE應該沒問題吧?
我想說這樣比較能確保每個樣本都iid於母體分配。
---
程式又大幅度改寫:
#X~N(0,1)
set.seed(1)
par(mfrow=c(1,2))
popnorm<-rnorm(n=100000,mean=0, sd=1)
meanpopnorm<-mean(popnorm)
varpopnorm<-var(popnorm)
print(paste("The mean of the population is",round(meanpopnorm,5)))
print(paste("The variance of the population is", round(varpopnorm,5)))
plot(density(popnorm))
cltnorm<-function(n){
meannorm<-array(0,dim=c(1000))
for (i in 1:1000) {
meannorm[i]=mean(sample(popnorm,size=n,replace=TRUE))
}
print("----------------------------------------------------------")
print(paste("The mean of samples is",round(mean(meannorm),5)))
print(paste(", and the variance of samples is",round(var(meannorm),5)))
print(paste("According to CLT, the theoretical mean is",
round(meanpopnorm,5)))
print(paste(", and the theoretical variance of samples
is",round(varpopnorm/n,5)))
print("----------------------------------------------------------")
plot(density(meannorm))
jarque.bera.test(meannorm)
}
接下來重覆五次試驗(n=30):
> cltnorm(30)
[1] "----------------------------------------------------------"
[1] "The mean of samples is 0.00355"
[1] ", and the variance of samples is 0.03627"
[1] "According to CLT, the theoretical mean is -0.00224"
[1] ", and the theoretical variance of samples is 0.03357"
[1] "----------------------------------------------------------"
Jarque Bera Test
data: meannorm
X-squared = 4.4989, df = 2, p-value = 0.1055
> cltnorm(30)
[1] "----------------------------------------------------------"
[1] "The mean of samples is -0.00778"
[1] ", and the variance of samples is 0.03286"
[1] "According to CLT, the theoretical mean is -0.00224"
[1] ", and the theoretical variance of samples is 0.03357"
[1] "----------------------------------------------------------"
Jarque Bera Test
data: meannorm
X-squared = 0.4475, df = 2, p-value = 0.7995
>
> cltnorm(30)
[1] "----------------------------------------------------------"
[1] "The mean of samples is 0.00405"
[1] ", and the variance of samples is 0.03348"
[1] "According to CLT, the theoretical mean is -0.00224"
[1] ", and the theoretical variance of samples is 0.03357"
[1] "----------------------------------------------------------"
Jarque Bera Test
data: meannorm
X-squared = 3.2785, df = 2, p-value = 0.1941
> cltnorm(30)
[1] "----------------------------------------------------------"
[1] "The mean of samples is -0.0048"
[1] ", and the variance of samples is 0.03026"
[1] "According to CLT, the theoretical mean is -0.00224"
[1] ", and the theoretical variance of samples is 0.03357"
[1] "----------------------------------------------------------"
Jarque Bera Test
data: meannorm
X-squared = 3.2825, df = 2, p-value = 0.1937
> cltnorm(30)
[1] "----------------------------------------------------------"
[1] "The mean of samples is 5e-05"
[1] ", and the variance of samples is 0.03214"
[1] "According to CLT, the theoretical mean is -0.00224"
[1] ", and the theoretical variance of samples is 0.03357"
[1] "----------------------------------------------------------"
Jarque Bera Test
data: meannorm
X-squared = 10.1117, df = 2, p-value = 0.006372
最後一次試驗錯誤的拒絕虛無假設...
我該怎麼解釋呢= ="
---
剛這樣寫:
testing<-function(n, m){
set.seed(1)
popnorm<-rnorm(n=100000,mean=0, sd=1)
error<-array(0,dim=c(m))
for (i in 1:m){
meannorm<-array(0,dim=c(1000))
for (i in 1:1000) {
meannorm[i]=mean(sample(popnorm,size=n,replace=TRUE))
}
test<-jarque.bera.test(meannorm)
print(test)
p<-as.numeric(test$p.value)
print(p)
error[i]<-ifelse(p<0.05,1,0)
print(error[i])
}
return(list(matrix=error))
}
使用testing(30,50)測試,
可是傳回的矩陣很怪...
變成有1000個元素的一維陣列,
而且第五筆資料應該要存入1,卻反而是0。
※ 編輯: anovachen 來自: 111.255.24.52 (12/15 23:48)
→
12/16 08:49, , 6F
12/16 08:49, 6F
set.seed(1)
popnorm<-rnorm(n=100000,mean=0, sd=1)
error<-array(0,dim=c(30))
testing<-function(n){
meannorm<-array(0,dim=c(1000))
for (i in 1:1000) {
meannorm[i]=mean(sample(popnorm,size=n,replace=TRUE))
}
test<-jarque.bera.test(meannorm)
print(test)
p<-as.numeric(test$p.value)
print(p)
return(ifelse(p<0.05,1,0))
}
for (i in 1:30){
error[i]<-testing(30)
}
這樣寫就OK了。
不曉得上面寫法問題出在哪?? >"<
----
以下是某不願具名者提供的程式碼:
n <- 30
ret <- NULL
for(i in 1:100){
x <- rnorm(n)
x.hat <- mean(x) * sqrt(n) # CLT stat. goes to N(0, 1) in d.
ret[i] <- x.hat
}
ks.test(ret, "pnorm")
# Test 100 x.hat's against N(0, 1)
他指出我原本的寫法是類似bootstrap,在有限的樣本(N=100000)自助重抽樣,
以模擬從真正的母體裡抽樣的過程。
而他的寫法則比較貼近我想表達的。
依據他的寫法,我們可以如此改寫:
n <- 100
ret <- NULL
for(i in 1:100){
x <- runif(n,min=0,max=1)
x.hat <- mean(x)
ret[i] <- x.hat
}
hist(ret)
ks.test(ret, "pnorm", mean=1/2, sd = sqrt(1/12/n))
如此可驗證從U(0,1)中抽樣,n=100,則X-bar的分布
是否會漸近N(1/2, 1/1200)。
※ 編輯: anovachen 來自: 111.255.13.49 (12/19 00:47)
推
12/22 03:35, , 7F
12/22 03:35, 7F
→
12/22 03:36, , 8F
12/22 03:36, 8F
→
12/29 16:25, , 9F
12/29 16:25, 9F
R_Language 近期熱門文章
PTT數位生活區 即時熱門文章