Skip to content

Optimize distribution of worker tasks. #6

@rbdavid

Description

@rbdavid

The process_many_files tasks are the computational bottleneck. Currently, these tasks are spun up with a variable number of to-be-processed .dat.gz files where the max number is hardcoded via the ideal_nFiles variable (set to 10). If a glob_subdir task completes and has fewer than 10 .dat.gz files, only one task is created that processes all the found files. Below is the method of sharding the to-be-processed files into tasks:

ideal_nFiles = 10
nTasks = int(len(results[1])/ideal_nFiles) + 1
shards = [results[1][i::nTasks] for i in range(nTasks)]

This is a pretty quick and dirty method that limits the number of files per task to be <= 10. This may be sufficiently performant for our needs but a more efficient sharding method should be possible. Ideas:

  • Consider the .dat.gz file sizes. Some files can be 10s of GBs in their gzip'd form while others are only MBs. Its possible to optimize the workflow by considering file size, something like summing over file sizes until a size threshold is hit, determining the set of to-be-processed files for each shard. This may still result in an uneven distribution of to-be-processed across the tasks.
  • Currently, the list of to-be-processed files are limited to those output from each individual glob_subdirs tasks. If, instead, the list of all files were continually aggregated, naive of parent subdir, then much more uniform shards could be assigned as tasks. This might take a bit of work to implement though since the current code expects the same parent subdir for all files sent to the process_many_files task.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions