Skip to content

Horovod Runner is stuck. Not passing through the first epoch after start training. #250

@camposwalacy

Description

@camposwalacy

Hello, folks!

I am using HorovodRunner within Databricks runtime LTS 14.2 ML with Tensorflow 14.0 through sparkdl. My data is in TFRecords format, and this issue started to happen after 25th June. I migrated my workload to Unity Catalog. I am debugging on my side if there is something that might have changed, but I couldn't find a way to fix this yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions