Skip to content

Question: Best practices to limit CPU usage? #437

@jashapiro

Description

@jashapiro

I have been starting some work running CytoTable to work with some data on CellPainting Gallery, and I am wrapping it as part of a Nextflow workflow. One of the things I would like to be able to do is to limit the CPU usage of cytotable.convert() while maintaining its efficiency, and I was not quite clear on the best way to do that.

I can limit the number of parsl processes/threads with its config, but duckdb and pyarrow seem to have their own CPU discovery that default to the full set of available cores, so I still end up with quite a large number of threads using that setting alone. It seems like I should be able to also set CYTOTABLE_MAX_THREADS and that would limit duckdb, but perhaps not pyarrow? I think pyarrow can also be limited on its own with set_cpu_count() and set_io_thread_count().

So there seem to be a number of possible places to limit the CPU use, and I what I am trying to work out is how best to balance those. Do you have any recommendations for balancing the CPU allocation among parsl, duckdb, and pyarrow?

Secondarily, for a system running within a single Docker container for import, do you have a recommendation between HighThroughputExecutor and ThreadPoolExecutor? It seems from the paper that multithreading is generally a bit more efficient, but I wanted to confirm that was what you would recommend.

Thanks for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions