The process_many_files tasks are the computational bottleneck. Currently, these tasks are spun up with a variable number of to-be-processed .dat.gz files where the max number is hardcoded via the ideal_nFiles variable (set to 10). If a glob_subdir task completes and has fewer than 10 .dat.gz files, only one task is created that processes all the found files. Below is the method of sharding the to-be-processed files into tasks:
ideal_nFiles = 10
nTasks = int(len(results[1])/ideal_nFiles) + 1
shards = [results[1][i::nTasks] for i in range(nTasks)]
This is a pretty quick and dirty method that limits the number of files per task to be <= 10. This may be sufficiently performant for our needs but a more efficient sharding method should be possible. Ideas:
- Consider the .dat.gz file sizes. Some files can be 10s of GBs in their gzip'd form while others are only MBs. Its possible to optimize the workflow by considering file size, something like summing over file sizes until a size threshold is hit, determining the set of to-be-processed files for each shard. This may still result in an uneven distribution of to-be-processed across the tasks.
- Currently, the list of to-be-processed files are limited to those output from each individual
glob_subdirs tasks. If, instead, the list of all files were continually aggregated, naive of parent subdir, then much more uniform shards could be assigned as tasks. This might take a bit of work to implement though since the current code expects the same parent subdir for all files sent to the process_many_files task.
The
process_many_filestasks are the computational bottleneck. Currently, these tasks are spun up with a variable number of to-be-processed .dat.gz files where the max number is hardcoded via theideal_nFilesvariable (set to 10). If aglob_subdirtask completes and has fewer than 10 .dat.gz files, only one task is created that processes all the found files. Below is the method of sharding the to-be-processed files into tasks:This is a pretty quick and dirty method that limits the number of files per task to be <= 10. This may be sufficiently performant for our needs but a more efficient sharding method should be possible. Ideas:
glob_subdirstasks. If, instead, the list of all files were continually aggregated, naive of parent subdir, then much more uniform shards could be assigned as tasks. This might take a bit of work to implement though since the current code expects the same parent subdir for all files sent to theprocess_many_filestask.