I have been starting some work running CytoTable to work with some data on CellPainting Gallery, and I am wrapping it as part of a Nextflow workflow. One of the things I would like to be able to do is to limit the CPU usage of cytotable.convert() while maintaining its efficiency, and I was not quite clear on the best way to do that.
I can limit the number of parsl processes/threads with its config, but duckdb and pyarrow seem to have their own CPU discovery that default to the full set of available cores, so I still end up with quite a large number of threads using that setting alone. It seems like I should be able to also set CYTOTABLE_MAX_THREADS and that would limit duckdb, but perhaps not pyarrow? I think pyarrow can also be limited on its own with set_cpu_count() and set_io_thread_count().
So there seem to be a number of possible places to limit the CPU use, and I what I am trying to work out is how best to balance those. Do you have any recommendations for balancing the CPU allocation among parsl, duckdb, and pyarrow?
Secondarily, for a system running within a single Docker container for import, do you have a recommendation between HighThroughputExecutor and ThreadPoolExecutor? It seems from the paper that multithreading is generally a bit more efficient, but I wanted to confirm that was what you would recommend.
Thanks for your help!
I have been starting some work running CytoTable to work with some data on CellPainting Gallery, and I am wrapping it as part of a Nextflow workflow. One of the things I would like to be able to do is to limit the CPU usage of
cytotable.convert()while maintaining its efficiency, and I was not quite clear on the best way to do that.I can limit the number of
parslprocesses/threads with its config, butduckdbandpyarrowseem to have their own CPU discovery that default to the full set of available cores, so I still end up with quite a large number of threads using that setting alone. It seems like I should be able to also setCYTOTABLE_MAX_THREADSand that would limitduckdb, but perhaps notpyarrow? I think pyarrow can also be limited on its own withset_cpu_count()andset_io_thread_count().So there seem to be a number of possible places to limit the CPU use, and I what I am trying to work out is how best to balance those. Do you have any recommendations for balancing the CPU allocation among
parsl,duckdb, andpyarrow?Secondarily, for a system running within a single Docker container for import, do you have a recommendation between
HighThroughputExecutorandThreadPoolExecutor? It seems from the paper that multithreading is generally a bit more efficient, but I wanted to confirm that was what you would recommend.Thanks for your help!