Optimize array-from-ctypes in basic.py (#3927)
Approximately %80 of runtime when loading "low column count, high row count" DataFrames into Datasets is consumed in `np.fromiter`, called as part of the `Dataset.get_field` method. This is particularly pernicious hotspot, as unlike other ctypes-based methods this is a hot loop over a python iterator loop and causes significant GIL-contention in multi-threaded applications. Replace `np.fromiter` with a direct call to `np.ctypeslib.as_array`, which allows a single-shot `copy` of the underlying array. This reduces the load time of a ~35 million row categorical dataframe with 1 column from ~5 seconds to ~1 second, and allows multi-threaded execution.
Showing
Please register or sign in to comment