Optimize distribution of worker tasks.

The `process_many_files` tasks are the computational bottleneck. Currently, these tasks are spun up with a variable number of to-be-processed .dat.gz files where the max number is hardcoded via the `ideal_nFiles` variable (set to 10). If a `glob_subdir` task completes and has fewer than 10 .dat.gz files, only one task is created that processes all the found files. Below is the method of sharding the to-be-processed files into tasks: 

```
ideal_nFiles = 10
nTasks = int(len(results[1])/ideal_nFiles) + 1
shards = [results[1][i::nTasks] for i in range(nTasks)]
```

This is a pretty quick and dirty method that limits the number of files per task to be <= 10. This may be sufficiently performant for our needs but a more efficient sharding method should be possible. Ideas:

- Consider the .dat.gz file sizes. Some files can be 10s of GBs in their gzip'd form while others are only MBs. Its possible to optimize the workflow by considering file size, something like summing over file sizes until a size threshold is hit, determining the set of to-be-processed files for each shard. This may still result in an uneven distribution of to-be-processed across the tasks.  
- Currently, the list of to-be-processed files are limited to those output from each individual `glob_subdirs` tasks. If, instead, the list of all files were continually aggregated, naive of parent subdir, then much more uniform shards could be assigned as tasks. This might take a bit of work to implement though since the current code expects the same parent subdir for all files sent to the `process_many_files` task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize distribution of worker tasks. #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Optimize distribution of worker tasks. #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions