[討論] 請問如何設計具備統計意義上overfit和und已刪文

看板DataScience作者filialpiety (filialpiety)時間3年前 (2021/12/24 17:40)推噓0(0推 0噓 0→)

留言0則, 0人參與討論串1/1

作業系統：win10,linux 問題類別：ML、DL的pipe line設計使用工具：Python 問題內容：小弟過去只做過依照一篇PAPER的PIPELINE，做 training set 和 testing set，透過不斷重複抽樣和hyperparameter後，找出 testing set 下每種演算法的performance，每種演算法的performance都有mean和std，然後可以找出哪種演算法的performance具備顯著最高的結論。現在我想再更進一步做演算法做演算法預測是否有overfit還是underfit的問題以下是我想法: 我想把 training set 再切出validation set，做n次的cross validation和hyperparameter，得到n筆validation set下的performance。接著PIPELINE跑m次重複抽樣和hyperparameter，最後輸出m筆testing set 的performance，n*m筆validation set的performance。請問testing_performance(std)和validation_performance(std)做比較時(當他們都高於隨機猜測)： 1) validation_performance和testing_performanc無顯著差異時，可否下該演算法無overfit或underfit狀況? 2) validation_performance顯著高於 testing_performanc時，可否下該演算法有underfit狀況? 3) validation_performance顯著低於 testing_performanc時，可否下該演算法有overfit狀況? 還是各位大大有沒有其他想法 ? 或是有哪些PAPER提供這類對於underfit、overfit具備統計意義的算法和討論? 感謝萬分 -- Sent from nPTT on my iPhone -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 140.116.56.96 (臺灣) ※ 文章網址: https://www.ptt.cc/bbs/DataScience/M.1640338852.A.3FD.html