Re: [討論] 每行抓取特定值
※ 引述《fifish89 (OMG)》之銘言:
: [問題類型]:
: 想針對每行判別是否有包含某特定值
: [軟體熟悉度]:
: 使用者(已經有用R 做過不少作品)
: [問題敘述]:
: 資料格式如下:
: ID D1 D2 D3 D4....D50
: A 123 23 ......... 55
: B 24 005
: C 504 . . 002
: D 002 . . .
: . . . ...
: H . . 002 ...
: . . . ...
: XX 410 . ...
: ============================
: CODE1<-有指定某些數字一
: CODE2<-有指定某些數字二
: .
: .
: .
: CODE15<-有指定某些數字十五
: 這是醫院資料,
: 每個人若最多有50個診斷欄位(D1-->D50),
: 而CODE1...CODE15是每個疾病的指定疾病碼,
: 想要判斷每個人是否有罹患這些疾病。
: EX:
: 假設CODE1=002(中風(stroke)疾病碼),
: 我就會去判斷每個診斷欄位中(D1-->D50)是否有002這個疾病碼,
: ID=C,在D4中有抓到中風碼
: ID=D,在D1中有抓到中風碼
: ID=H,在D3中有抓到中風碼
: 希望輸出結果為
: ID stroke ....
: A 0
: B 0
: C 1
: D 1
: . .
: . .
: H 1
: . .
: XX .
程式好讀版:http://pastebin.com/MN73bPPH
library(data.table)
library(tidyr)
library(dplyr)
library(magrittr)
N = 1e5
dat = data.table(ID = paste0("P", 1:N))
for (i in 1:50)
eval(parse(text = paste0("dat %<>% mutate(D", i,
"=paste0(sample(LETTERS, N, TRUE),",
"sample(as.character(1:100), N, TRUE)))")))
# single search
code1 = "D5"
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID) %>%
summarise(stroke = any(code == code1)) %>% distinct()
proc.time() - st
# user system elapsed
# 0.42 0.05 0.47
# multiple search
codes_search = c("D5", "E55", "Z2", "A96")
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID)
for (i in 1:length(codes_search))
eval(parse(text = paste0("a %<>% mutate(codes_search_", i,
"=codes_search[", i, "])")))
b = a %>% group_by(ID) %>% summarise_(.dots = paste0(
"any(code == codes_search_", 1:length(codes_search),")")) %>%
setnames(c("ID", paste0("stroke", 1:length(codes_search))))
proc.time() - st
# user system elapsed
# 2.74 0.26 3.01
## list search
diseases = vector('list', 15)
for (i in 1:length(diseases))
diseases[[i]] = paste0(sample(LETTERS, i, TRUE),
sample(1:100, i, TRUE))
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID)
out_list = llply(1:length(diseases), function(i){
a %>% setkey(ID) %>% group_by(ID) %>%
summarise(stroke = any(code %in% diseases[[i]])) %>%
setnames("stroke", paste0("stroke_", i))
})
out = Reduce(function(x, y) merge(x, y), out_list)
proc.time() - st
# user system elapsed
# 8.88 0.36 9.23
## faster way
st = proc.time()
a = dat %>% gather(D, code, -ID) %>% group_by(ID)
eval(parse(text = paste("a %<>% summarise(", paste0("stroke_",
1:length(diseases) ,"=any(diseases[[", 1:length(diseases),
"]] %in% code)", collapse = ","), ")")))
proc.time() - st
# user system elapsed
# 5.79 0.05 5.85
## the fastest way
st = proc.time()
out2 = vector('list', length(diseases))
for (i in 1:length(diseases))
out2[[i]] = rowMeans(sweep(as.matrix(dat[,2:51, with = FALSE]), 2,
1:(ncol(dat)-1), function(x, y){matrix(x %in% diseases[[i]], nrow(dat))})) > 0
out2 = do.call(cbind, out2) %>% data.table %>% cbind(dat$ID, .) %>%
setnames(c("ID", paste0("stroke_", 1:length(diseases)))) %>% tbl_dt(FALSE)
proc.time() - st
# user system elapsed
# 3.20 0.52 3.71
all.equal(out2, a) # TRUE## the fastest way
關於multiple search部分,我稍微說明一下
發現summarise_裡面lazyeval部分的environment設定在data內部
可能是避免一些問題才這樣設計
所以沒辦法直接用.dots把global variables塞進去跑
只好先手動mutate進去之後
再一次summarise
--
※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 123.205.27.107
※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1437373001.A.DE6.html
→
07/20 22:32, , 1F
07/20 22:32, 1F
→
07/20 22:33, , 2F
07/20 22:33, 2F
→
07/20 22:35, , 3F
07/20 22:35, 3F
→
07/20 23:07, , 4F
07/20 23:07, 4F
推
07/21 09:55, , 5F
07/21 09:55, 5F
※ 編輯: celestialgod (123.205.27.107), 07/22/2015 14:46:55
討論串 (同標題文章)
R_Language 近期熱門文章
PTT數位生活區 即時熱門文章