9 add toc code by rbdavid · Pull Request #14 · EnzymeFunctionInitiative/ENA_Database_Build

rbdavid · 2025-10-01T05:33:59Z

Initial push of the TOC generation code.

The workflow code uses the same "glob" functions as the ENA build workflow code uses to gather all of the files from ENA subdirectories. Then, "shards" of these files are handed to workers and metadata associated with each file is gathered. Metadata hardcoded at the moment is file path, file size (bytes), and last modified time (epoch seconds). There is an option to also save the md5 hash for each file.

Some cleanup of the original set of dask_tasks.py happened to make things cleaner. More testing of these tasks and the imports needs to happen to ensure that the old workflow as well as the new TOC generation workflow work as intended.

…will be used in multiple workflows

rbdavid · 2025-10-01T05:34:13Z

closes #9

rbdavid · 2025-10-01T19:11:34Z

I should note that the current code should never have args.id_mapping == True because that code (mapping.process_file()) has not been created. I have added the handles for that functionality to this workflow as a design guide; the toc_generation.py code can (in the near future) be used to generate one or both of the TOC and protein ID mapping tables.

rbdavid · 2025-10-01T19:46:19Z

To avoid confusion with other "mapping" files from other databases, I'm going to start using "protein_id index" instead of "id mapping". This phrase will then mirror the publishing terminology table of contents (toc) that I've been using for some time now.

… command

…hese regex patterns

This file is associated with issue #10 and should not have been committed in this PR. I think this happened when I renamed the subdirectory; all files within the subdirectory got added to the commit.

This file should not have been committed in this PR. I think this happened when I renamed the subdirectory; all files within the subdirectory got added to the commit.

nilsoberg

Formatting suggestions, otherwise good.

nilsoberg · 2025-10-14T16:09:29Z

+                # finished task is a glob_files task, so results[1] will be
+                # the list of gzipped files. Break this list down into bite
+                # sized chunks and submit a new task for each chunk.
+                #shards = [results[1][i::args.n_workers] for i in range(args.n_workers)]


Is this commented code still needed?

nilsoberg · 2025-10-14T16:09:41Z

+                # to be processed across the tasks. If n_workers is > than
+                # files in results[1], then only the necessary number of tasks
+                # to process one file per worker are created.
+                #new_futures = [client.submit(process_many_files, shard, database_params = database_params, db_name = args.db_name, final_output_dir = args.output_dir, temp_output_dir = args.local_scratch) for shard in shards if shard]


Code still needed?

nilsoberg · 2025-10-14T16:12:44Z

+    """
+    Returns an `<argparse.Namespace>` with attributes associated with the input
+    arguments for the ENA database build script:
+        * ``--ena-paths``, a number of path strings; accepts multiple values


I wonder if the docstring formatting in metadata_generate_tasks.py is clearer than this formatting. I.e. write

* ``--output-dir`` or ``-out``, path written within which files will be written

Either way, the formatting should be consistent throughout the app.

nilsoberg · 2025-10-14T16:13:50Z

 from . import mysql_database
 from . import parse_embl
 from . import dask_tasks
+from . import glob_tasks


Order of imports

rbdavid added 2 commits September 30, 2025 14:24

move file globbing functions to a separate module file because these …

2101952

…will be used in multiple workflows

initial push of the cleaned TOC generation code

0e68436

rbdavid self-assigned this Oct 1, 2025

rbdavid added the enhancement New feature or request label Oct 1, 2025

rbdavid linked an issue Oct 1, 2025 that may be closed by this pull request

Add table of content generating code #9

Open

rbdavid requested a review from nilsoberg October 1, 2025 15:35

rbdavid added 2 commits October 1, 2025 13:48

improve IO and clean up code

02879f5

add doc strings and clean up

6b3ee6b

remove phrasing of id mapping to instead us protein_id index

f715393

rbdavid marked this pull request as ready for review October 1, 2025 19:52

rbdavid added 10 commits October 1, 2025 15:33

rename to better represent the intent of the code

8fa4f4d

rename to better represent the intent of the code, second pass

7e71205

rename subdir to better represent the code housed within

011fad7

pull logging functions out of the tskmgr codes and import this code

41cac09

update toml file to include the metadata_generation workflow as a cli…

dc25f3e

… command

centralize file path regex patterns to glob_tasks.py; add tests for t…

1a2493a

…hese regex patterns

remove contents because this was not supposed to be committed

6a83c34

Delete ena_build/GenerateMetadata/mapping.py

bd11c83

This file is associated with issue #10 and should not have been committed in this PR. I think this happened when I renamed the subdirectory; all files within the subdirectory got added to the commit.

Delete ena_build/GenerateMetadata/parse_embl.py

081d05f

This file should not have been committed in this PR. I think this happened when I renamed the subdirectory; all files within the subdirectory got added to the commit.

add regex pattern constant to import line

f9f9130

nilsoberg approved these changes Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

9 add toc code#14

9 add toc code#14
rbdavid wants to merge 15 commits into
mainfrom
9-add-TOC-code

rbdavid commented Oct 1, 2025

Uh oh!

rbdavid commented Oct 1, 2025

Uh oh!

rbdavid commented Oct 1, 2025

Uh oh!

rbdavid commented Oct 1, 2025

Uh oh!

nilsoberg left a comment

Uh oh!

nilsoberg Oct 14, 2025

Uh oh!

nilsoberg Oct 14, 2025

Uh oh!

nilsoberg Oct 14, 2025

Uh oh!

nilsoberg Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rbdavid commented Oct 1, 2025

Uh oh!

rbdavid commented Oct 1, 2025

Uh oh!

rbdavid commented Oct 1, 2025

Uh oh!

rbdavid commented Oct 1, 2025

Uh oh!

nilsoberg left a comment

Choose a reason for hiding this comment

Uh oh!

nilsoberg Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

nilsoberg Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

nilsoberg Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

nilsoberg Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants