Skip to content

9 add toc code#14

Open
rbdavid wants to merge 15 commits into
mainfrom
9-add-TOC-code
Open

9 add toc code#14
rbdavid wants to merge 15 commits into
mainfrom
9-add-TOC-code

Conversation

@rbdavid
Copy link
Copy Markdown
Contributor

@rbdavid rbdavid commented Oct 1, 2025

Initial push of the TOC generation code.

The workflow code uses the same "glob" functions as the ENA build workflow code uses to gather all of the files from ENA subdirectories. Then, "shards" of these files are handed to workers and metadata associated with each file is gathered. Metadata hardcoded at the moment is file path, file size (bytes), and last modified time (epoch seconds). There is an option to also save the md5 hash for each file.

Some cleanup of the original set of dask_tasks.py happened to make things cleaner. More testing of these tasks and the imports needs to happen to ensure that the old workflow as well as the new TOC generation workflow work as intended.

@rbdavid rbdavid self-assigned this Oct 1, 2025
@rbdavid
Copy link
Copy Markdown
Contributor Author

rbdavid commented Oct 1, 2025

closes #9

@rbdavid rbdavid added the enhancement New feature or request label Oct 1, 2025
@rbdavid rbdavid linked an issue Oct 1, 2025 that may be closed by this pull request
@rbdavid rbdavid requested a review from nilsoberg October 1, 2025 15:35
@rbdavid
Copy link
Copy Markdown
Contributor Author

rbdavid commented Oct 1, 2025

I should note that the current code should never have args.id_mapping == True because that code (mapping.process_file()) has not been created. I have added the handles for that functionality to this workflow as a design guide; the toc_generation.py code can (in the near future) be used to generate one or both of the TOC and protein ID mapping tables.

@rbdavid
Copy link
Copy Markdown
Contributor Author

rbdavid commented Oct 1, 2025

To avoid confusion with other "mapping" files from other databases, I'm going to start using "protein_id index" instead of "id mapping". This phrase will then mirror the publishing terminology table of contents (toc) that I've been using for some time now.

@rbdavid rbdavid marked this pull request as ready for review October 1, 2025 19:52
Copy link
Copy Markdown

@nilsoberg nilsoberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting suggestions, otherwise good.

# finished task is a glob_files task, so results[1] will be
# the list of gzipped files. Break this list down into bite
# sized chunks and submit a new task for each chunk.
#shards = [results[1][i::args.n_workers] for i in range(args.n_workers)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented code still needed?

# to be processed across the tasks. If n_workers is > than
# files in results[1], then only the necessary number of tasks
# to process one file per worker are created.
#new_futures = [client.submit(process_many_files, shard, database_params = database_params, db_name = args.db_name, final_output_dir = args.output_dir, temp_output_dir = args.local_scratch) for shard in shards if shard]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code still needed?

"""
Returns an `<argparse.Namespace>` with attributes associated with the input
arguments for the ENA database build script:
* ``--ena-paths``, a number of path strings; accepts multiple values
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the docstring formatting in metadata_generate_tasks.py is clearer than this formatting. I.e. write

* ``--output-dir`` or ``-out``, path written within which files will be
   written

Either way, the formatting should be consistent throughout the app.

Comment thread ena_build/__init__.py
from . import mysql_database
from . import parse_embl
from . import dask_tasks
from . import glob_tasks
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order of imports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add table of content generating code

2 participants