Re: [問題]請教如何加快dataframe的條件判斷

看板Python作者celestialgod (天)時間2年前 (2023/05/15 01:47)推噓5(5推 0噓 3→)

留言8則, 5人參與討論串2/2 (看更多)

※ 引述《liquidbox (樹枝擺擺)》之銘言： : 請問，我有一個近萬個由不重複字串組成的list叫kw_list，以及一個df : 範例是['book','money','future','file'] : Index sentence : 1 This is a book : 2 back to the future : 3 replace the file : 4 come on : 5 have a nice weekend : 我想要把list中的字串逐一拉出來， : 跟sentence那個欄位比較，如果sentence欄位有包含該字串（近萬個都要逐一比對） : 就標上True，否則就False : 我建了一個近萬個column的新dataframe，欄位是kw_list : 然後跟原本的df合併起來， : 然後再寫個條件判斷式，若該筆資料的sentence包含該字串， : 那個column就標上True，不然就False : 於是會變成 : Index sentence book money future file : 1 This is a book TRUE FALSE FALSE FALSE : 2 back to the future FALSE FALSE TRUE FALSE : 3 replace the file FALSE FALSE FALSE TRUE : 4 come on FALSE FALSE FALSE FALSE : 5 have a nice weekend FALSE FALSE FALSE FALSE : 不意外地，我用迴圈去判斷，跑幾小時都跑不出結果，如下： : for kw in kw_list: : df.loc[df['sentence'].str.contains(kw),df[kw]]=True : 我覺得我把同樣的東西丟到Excel用函數算可能都比較快， : 請問有什麼方法改寫，讓這個df的運算速度加快嗎有幾個人跟我稍微討論了一下我這裡放上幾個方法的比較三十萬個隨機句子隨機抓出2389個關鍵字三個方法的結果如下 1. polars 1.21 s ± 59.3 ms per loop (mean ± std. dev. of 5 runs, 3 loops each) 2. pandas a. Pre-allocate columns first and set values 6min 36s b. for loop add columns => PerformanceWarning 6min 16s c. Pre-allocate columns + np.where + pandas .at 7min 59s 3. duckdb 24.4 s ± 177 ms per loop (mean ± std. dev. of 2 runs, 3 loops each) 4. numpy a. pre-allocate + for-loop: 4min 23s b. pre-allocate + np.char: > 6 minutes 5. Cython 1.73 s ± 14.7 ms per loop (mean ± std. dev. of 3 runs, 5 loops each) 可以看出來polars跟2/3/4方法根本不在同一個量級 polars的with_columns在Rust底層中是會做multi-threading 另外3個都是single thread，所以根本沒法比 Cython則是有接近polars的效能，但是還是小輸，而且結果是np.array不是dataframe 如果捨棄DataFrame的操作的話 Cython+numpy有接近媲美polars的速度但是很難寫也很難調校我寫了三小時多才得到一個滿意的版本附上測試程式碼: https://reurl.cc/9VMOkO 機器: AMD TR-2990WX@3.6GHz with Python 3.10.9 on Windows 11 polars version: 0.17.13 pandas version: 1.5.3 duckdb version: 0.7.1 -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 125.229.239.131 (臺灣) ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1684086439.A.CF1.html