Skip to content

Latest commit

 

History

History
113 lines (94 loc) · 5.96 KB

faqs.md

File metadata and controls

113 lines (94 loc) · 5.96 KB

Frequently Asked Questions

This section provides answers to common questions organized by Nemo-Run functions.

Configuration

Q: UnserializableValueError when using run.Partial or run.Config

fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value .tmp of type <class 'pathlib.PosixPath'>. Error occurred at path '<root>.something'."

A: Every nested object inside run.Partial or run.Config needs to be serializable. As a result, if you are trying to configure objects, it's better to wrap them in run.Config. For example, the above error arises when you do the following:

from nemorun.config import ZlibJSONSerializer

partial = run.Partial(some_function, something=Path("/tmp"))
ZlibJSONSerializer().serialize(partial)

You can fix it by doing:

from nemorun.config import ZlibJSONSerializer

partial = run.Partial(some_function, something=run.Config(Path, "/tmp"))
ZlibJSONSerializer().serialize(partial)

Q: Deserialization error when using run.Partial or run.Config

One example shown below

ValueError: Using the Buildable constructor to convert a buildable to a new type or to override arguments is forbidden; please use either `fdl.cast(new_type, buildable)` (for casting) or `fdl.copy_with(buildable, **kwargs)` (for overriding arguments).

A: Ensure that only Config or Partial objects are present in your nested configuration. You can run a quick sanity check by doing

from nemorun.config import ZlibJSONSerializer

serializer = ZlibJSONSerializer()
partial = run.Partial(some_function, something=run.Config(Path, "/tmp"))
serializer.deserialize(serializer.serialize(partial)) == partial

Q: How to use control flow in autoconvert?

If I use control flow with run.autoconvert, I get UnsupportedLanguageConstructError: Control flow (ListComp) is unsupported by auto_config.. For example, the below doesn't work.

@run.autoconvert
def control_flow() -> llm.PreTrainingDataModule:
    return llm.PreTrainingDataModule(
        paths=[Path(f"some_doc_{i}") for i in range(10)],
        weights=[1 for i in range(10)]
    )

A: As the error mentions, control flow in run.autoconvert is not supported. To overcome, just return a config directly and use it like a regular python function. So the example would become

def control_flow_config() -> run.Config[llm.PreTrainingDataModule]:
    return run.Config(
        llm.PreTrainingDataModule,
        paths=[run.Config(Path, f"some_doc_{i}") for i in range(10)],
        weights=[1 for i in range(10)]
    )

Q: I made a change locally in my git repo and tested it using the local executor. However, the change is not reflected in the remote job.

A: This is most likely because you haven't committed the changes. See details about GitArchivePackager here to learn more.

Q: I made a change locally outside my git repo and tested it using the local executor. However, the change is not reflected in the remote job.

A: Currently, we only package your current repo. To transport changes to other repos on the remote cluster, you need to check out the package on the remote cluster and then mount it at the correct path in your docker image. We will add support for packaging multiple repos in the future.

Execution

Q: For SlurmExecutor, how can I execute directly from the login node of the cluster.

A: For example, to execute the SlurmExecutor from your local machine via SSH, you may have:

ssh_tunnel = run.SSHTunnel(
    host="your-slurm-host",
    user="your-user",
    job_dir="/your/home/directory/nemo-run-experiments",
)
executor = run.SlurmExecutor(
    ...
    tunnel=ssh_tunnel,
    ...
)

If you are on the login node of the Slurm cluster, simply change the tunnel as shown below:

executor = run.SlurmExecutor(
    ...
    tunnel=run.LocalTunnel(),
    ...
)

Management

Q: I can't retrieve logs for an experiment.

A: There could be a few reasons for this, described below:

  • The Nemo-Run home has changed. By default home is at ~/.nemorun, but you can overwrite it using NEMORUN_HOME. Retrieving logs can be difficult if there's a discrepancy in the home between when you launched the experiment and when you try to retrieve it.
  • Nemo-Run home is deleted or overwritten from the time when you ran the experiment.
  • Logs are not available on the remote cluster. For example, if launching on Kubernetes using the SkypilotExecutor, and the Skypilot cluster is terminated or the pod is deleted, the logs won’t be available.