• Alex Ford's avatar
    Optimize array-from-ctypes in basic.py (#3927) · de8c6105
    Alex Ford authored
    Approximately %80 of runtime when loading "low column count, high row
    count" DataFrames into Datasets is consumed in `np.fromiter`, called
    as part of the `Dataset.get_field` method.
    
    This is particularly pernicious hotspot, as unlike other ctypes-based
    methods this is a hot loop over a python iterator loop and causes
    significant GIL-contention in multi-threaded applications.
    
    Replace `np.fromiter` with a direct call to `np.ctypeslib.as_array`,
    which allows a single-shot `copy` of the underlying array.
    
    This reduces the load time of a ~35 million row categorical dataframe
    with 1 column from ~5 seconds to ~1 second, and allows multi-threaded
    execution.
    de8c6105
basic.py 144 KB