[討論] Deep learning - the "why" question.

看板DataScience作者kaltu (ka)時間6年前 (2018/10/14 19:04)推噓8(8推 0噓 14→)

留言22則, 10人參與討論串1/1

https://blog.piekniewski.info/2018/10/13/deep-learning-the-why-question/ There are many many deep learning models out there doing various things. Depending on the exact task they are solving, they may be constructed differently. Some will use convolution followed by pooling. Some will use several convolutional layers before there is any pooling layer. Some will use max-pooling. Some will use mean-pooling. Some will have a dropout added. Some will have a batch-norm layer here and there. Some will use sigmoid neurons, some will use half-recitfiers. Some will classify and therefore optimize for cross-entropy. Others will minimize mean-squared error. Some will use unpooling layers. Some will use deconvolutional layers. Some will use stochastic gradient descent with momentum. Some will use ADAM. Some will have RESNET layers, some will use Inception. The choices are plentiful (see e.g.here). Reading any of these particular papers, one is faced with a set of choices the authors had made, followed by the evaluation on the dataset of their choice. The discussion of choices typically refers strongly to papers where given techniques were first introduced, whereas the results section typically discusses in detail the previous state of the art. The shape of the architecture is often broken down into obvious and non obvious decisions. The obvious ones are dictated by the particular task the authors are trying to solve (e.g. when they have an autoencoding like task, they obviously use a form of an autoencoder). The non obvious choices would include questions similar to those: Why did they use 3x3 conv followed by 1x1 conv and only then pooling? Why did they only replaced the 3 middle layers with MobileNet layers (ridiculous name BTW)? Why did they slap batch-norm only in the middle two layers and not all of them? Why did they use max-pooling in the first two layers and no pooling whatsoever in the following three? Obvious stuff is not discussed because it is obvious, the non-obvious stuff is not discussed because ... let me get back to that in a moment. In my opinion discussing these questions separates a paper from something at least shallowly scientific from complete charlatanry, even if thecharlatanry appears to improve the results on the given dataset. The sad truth, that few even talk about, is that in the vast majority of cases the answers to the why questions are purely empirical: they tried a bunch of models and these worked best - it is called "hyperparameter tuning" (or meta-parameter tuning). What does that tell us? A few things, first the authors are completely ignoring the danger of multiple hypothesis testing and generally piss on any statistical foundations of their "research". Second, they probably have more GPU's accessible than they know what to do with (very often they case in big companies these days). Third, they just want to stamp their names on some new record breaking benchmark, that obviously will be broken two weeks later by somebody who takes their model and does some extra blind tweaking, utilizing even more GPU power. This is not science. This has more to do with people who build beefy PC's and submit their 3dMark results to hold a record for a few days. It is a craft, no doubt, but it is not science. The PC-builders don't make a pretense for this to be any science. The deep learning people do. They write what appears to be research papers, just to describe their GPU-rig and the result of their random meta-parameter search, with perhaps some tiny shreds of real scientific discussion. Benchmark results provide a nice cover, to claim that the paper is in some way "novel" and interesting, but truth to the mater is, they just overfitted that dataset some more. They might just as well memorize the entire dataset in their model and achieve 100% accuracy, who cares? (read my AI winter addendum post for some interesting literature on the subject). Similarly to the difference between chemistry andalchemy, the scientific discussion is about building a concept, a theory that will enable one to make accurate predictions. Something to guide their experimental actions. Science does not need to make gold out of lead every time, or in the case of machine learning, a real scientific paper in this field does not need to beat some current benchmark. A scientific paper does not even need to answer any questions, if it happens to ask some good ones. Now obviously there are exceptions, a small fraction of papers have interesting stuff in them. These are mostly the ones which try to show the deficits of deep learning and engage into a discussion as to why that might be the case. So next time you read a deep learning paper, try to contemplate these quiet and never explained choices the authors have made. You'll be shocked to see how many of those are hidden between the lines. ---------------------------- 這篇文章打到我心坎裡就算上了這麼多課，實作了好幾十個model 也打了好幾場Kaggle 甚至即將發布paper 對自己還是有很深的imposter syndrome 到底，為什麼這個kernel是3不是2、filter為什麼設64成績就是比128好看工作最多的時間除了 data preprocessing 就是弄到質疑人生的 hyperparameter tuning 懷疑自己到底是在搞學術研究，還是窮舉各種可能性以結果論的資料煉金術去年韓國人拿掉了幾層BN就奪了NTIRE冠軍除了大家津津樂道地事後諸葛BN本來就不適合super resolution云云難到這背後到底有什麼經得起嚴謹檢驗的基礎嗎？這種自我質疑是來自我自己過深的自卑感還是其實沒那麼罕見呢？ ----- Sent from JPTT on my Asus ASUS_Z01KDA. -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 61.231.154.103 ※ 文章網址: https://www.ptt.cc/bbs/DataScience/M.1539515084.A.E47.html

推

GCWShadow

10/14 19:18, 6年前 , 1^F

10/14 19:18, 1^F

→

GCWShadow

10/14 19:18, 6年前 , 2^F

10/14 19:18, 2^F

→

GCWShadow

10/14 19:18, 6年前 , 3^F

10/14 19:18, 3^F