Re: [問題] 取樣的問題

看板R_Language作者celestialgod (攸藍)時間10年前 (2015/05/04 18:41)推噓3(3推 0噓 1→)

留言4則, 3人參與討論串2/2 (看更多)

※ 引述《ardodo (米蟲)》之銘言： : 版上先進大家好，我有個問題想請教大家 : 現在我手上有筆某大專院校22個系所的學生資料(共1萬筆) : 我想要在每個系所各取樣30名學生資料出來分析，請問該怎麼做？ : 我想到的方法是：每個系所subset一次、隨機抽30名出來存成一個物件，重覆22次 : 最後將上面22個物件rbind即可 : 但是這樣的做法很費時也沒有效率，想請問有沒有比較快的方法？我生成一個簡單的case示範直接利用split這個功能做切割，再做合併，會很快而pipe operator (magrittr:::%>%)就是為了避免儲存太多暫存物件而設計 library(data.table) library(dplyr) library(magrittr) nDepat = 22 nVar = 10 dat = replicate(nVar, rnorm(10000*nDepat)) %>% data.frame() %>% mutate(department = rep(LETTERS[1:nDepat],,,10000)) %>% tbl_df() nSubset = 30 dat2 = dat %>% split(.$department) %>% lapply(function(x){ x[sample(1:nrow(x), nSubset),]}) %>% do.call(rbind, .) # %>% do.call(rbind, .) 跟 %>% rbindlist(.) 是一樣的 # Another way by plyr library(plyr) dat3 = dat %>% plyr:::splitter_d(.(department)) %>% ldply( function(x) x[sample(1:nrow(x), nSubset),]) # third way to do dat4= dat %>% ddply(.(department), function(d) d[sample(1:nrow(d), nSubset),]) 放上測試，我個人會比較喜歡第三種，簡潔的程式。 library(rbenchmark) benchmark( method1 = dat %>% split(.$department) %>% lapply(function(x) x[sample(1:nrow(x), nSubset),]) %>% do.call(rbind, .), method2 = dat %>% plyr:::splitter_d(.(department)) %>% ldply( function(x) x[sample(1:nrow(x), nSubset),]) method3 = dat %>% ddply(.(department), function(d){ d[sample(1:nrow(d), nSubset),]}), replications = 30L, columns = c("test", "replications", "user.self", "sys.self", "elapsed", "relative"), order = "relative") # test replications user.self sys.self elapsed relative # 2 method2 30 2.64 0.03 2.93 1.000 # 3 method3 30 2.84 0.02 2.96 1.010 # 1 method1 30 2.90 0.17 3.12 1.065 至於用dplyr，我後來想到group_by做會比較麻煩，要先新增變數，然後再用filter，不建議 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 123.205.27.107 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1430736119.A.E54.html

→

obarisk

05/04 18:46, , 1^F

05/04 18:46, 1^F

我一開始以為splitter_d 會比 split快後來發現是沒仔細看，原文是 This is basically a thin wrapper around split which evaluates the variables in the context of the data 應該splitter_d應該不會比較快多少，至於 ldply 跟 lapply %>% do.call應該一樣快測試之後，速度差不了多少，看個人喜好吧