9 add toc code#14
Conversation
…will be used in multiple workflows
|
closes #9 |
|
I should note that the current code should never have |
|
To avoid confusion with other "mapping" files from other databases, I'm going to start using "protein_id index" instead of "id mapping". This phrase will then mirror the publishing terminology table of contents (toc) that I've been using for some time now. |
…hese regex patterns
This file is associated with issue #10 and should not have been committed in this PR. I think this happened when I renamed the subdirectory; all files within the subdirectory got added to the commit.
This file should not have been committed in this PR. I think this happened when I renamed the subdirectory; all files within the subdirectory got added to the commit.
nilsoberg
left a comment
There was a problem hiding this comment.
Formatting suggestions, otherwise good.
| # finished task is a glob_files task, so results[1] will be | ||
| # the list of gzipped files. Break this list down into bite | ||
| # sized chunks and submit a new task for each chunk. | ||
| #shards = [results[1][i::args.n_workers] for i in range(args.n_workers)] |
There was a problem hiding this comment.
Is this commented code still needed?
| # to be processed across the tasks. If n_workers is > than | ||
| # files in results[1], then only the necessary number of tasks | ||
| # to process one file per worker are created. | ||
| #new_futures = [client.submit(process_many_files, shard, database_params = database_params, db_name = args.db_name, final_output_dir = args.output_dir, temp_output_dir = args.local_scratch) for shard in shards if shard] |
| """ | ||
| Returns an `<argparse.Namespace>` with attributes associated with the input | ||
| arguments for the ENA database build script: | ||
| * ``--ena-paths``, a number of path strings; accepts multiple values |
There was a problem hiding this comment.
I wonder if the docstring formatting in metadata_generate_tasks.py is clearer than this formatting. I.e. write
* ``--output-dir`` or ``-out``, path written within which files will be
written
Either way, the formatting should be consistent throughout the app.
| from . import mysql_database | ||
| from . import parse_embl | ||
| from . import dask_tasks | ||
| from . import glob_tasks |
Initial push of the TOC generation code.
The workflow code uses the same "glob" functions as the ENA build workflow code uses to gather all of the files from ENA subdirectories. Then, "shards" of these files are handed to workers and metadata associated with each file is gathered. Metadata hardcoded at the moment is file path, file size (bytes), and last modified time (epoch seconds). There is an option to also save the md5 hash for each file.
Some cleanup of the original set of dask_tasks.py happened to make things cleaner. More testing of these tasks and the imports needs to happen to ensure that the old workflow as well as the new TOC generation workflow work as intended.