跳转至

RunnerConfig

danling.runners.RunnerConfig

Bases: Config

Configuration class for managing and persisting all states of a DanLing Runner.

The RunnerConfig class provides a hierarchical configuration system that handles:

  1. Parameter management: Hyperparameters, model settings, training options
  2. Experiment tracking: IDs, names, and other metadata for runs and experiments
  3. Serialization: Save/load configurations from files or command line
  4. Reproducibility: Tracking seeds and settings for reproducible runs

RunnerConfig inherits from Config and provides attribute-style access to nested values:

Python
1
2
3
4
5
6
7
8
9
config = RunnerConfig()

# Attribute-style access (recommended)
config.optim.lr = 1e-3
config.network.type = "resnet50"

# Dictionary-style access (alternative)
config["optim"]["lr"] = 1e-3
config["network"]["type"] = "resnet50"

RunnerConfig objects support three types of hierarchical attribute access patterns:

  1. Direct assignment for simple values:

    Python
    config.epochs = 10
    

  2. Auto-created nested objects for hierarchical settings:

    Python
    1
    2
    3
    # Auto-creates the nested objects
    config.optim.lr = 0.01
    config.optim.weight_decay = 1e-4
    

  3. Class-level annotations for typed properties with defaults:

    Python
    1
    2
    3
    class MyConfig(RunnerConfig):
        epochs: int = 10
        learning_rate: float = 0.001
    

Command-line integration is built-in. You can define a configuration and then override values via command line arguments:

Python
config = MyConfig()
config.parse()  # Parse CLI args, e.g., --epochs 20 --optim.lr 0.01

General:

Name Type Description
stack str

Runner stack selector used by danling.runners.Runner. Supported values: "auto", "ddp"/"torch", "graph", "deepspeed"/"ds", "parallel". Defaults to "auto" (resolved to "ddp" at runtime).

Reproducibility:

Name Type Description
seed int

Random seed for reproducibility. If not set, a random value is generated.

deterministic bool

Whether to enforce deterministic operations in PyTorch. Defaults to False for better performance. Set to True for exact reproducibility.

Progress:

Name Type Description
steps int | None

Final global step target for training. In step mode, training stops when global_step >= steps.

epochs int | None

Final epoch index boundary for training. In epoch mode, training iterates epochs until epoch == epochs.

accum_steps int

Number of micro-batches per optimizer step. Defaults to 1.

Model Evaluation:

Name Type Description
score.split str

Dataset split to use for model selection. Defaults to None. If unset, runner infers once (val -> validate -> first available) and reuses it unless that split disappears from results.

score.metric str

Metric key to use for model selection. Defaults to “loss”.

score.patience int | float

Early-stop patience in epoch mode. Defaults to infinity.

sched.interval str

Scheduler advancement policy. Supported values: "step" and "epoch"/"validation". Non-metric schedulers default to "step". Metric schedulers such as ReduceLROnPlateau default to "epoch" and advance after the aggregated round result is available.

sched.monitor str

Optional metric selector for metric schedulers. Supports dotted paths such as "val.loss". When unset, the runner prefers score.split/score.metric when available and otherwise resolves score.metric from the aggregated result.

Optimization:

Name Type Description
optim.type str | None

Optimizer registry key, for example "adamw" or "sgd". When unset, the runner does not auto-build an optimizer.

optim.lr / weight_decay / betas / eps / momentum

Common optimizer kwargs forwarded to the optimizer registry when present.

optim.param_groups list[dict] | None

Optional regex-based optimizer parameter groups. Each entry requires pattern, matched against TorchRunner.iter_optimizer_named_parameters() with re.search semantics, and may provide optimizer group options directly. Anchor patterns with ^/$ when a full FQN position matters. lr_multiplier, weight_decay_multiplier, beta1, and beta2 derive group values from top-level optim.lr, optim.weight_decay, and optim.betas. Unmatched parameters keep the optimizer-level defaults.

sched.type str | None

Scheduler registry key, for example "cosine", "linear", "step", or "reduce_on_plateau". When unset, the runner does not auto-build a scheduler.

sched.total_steps / warmup_steps / cooldown_steps / final_lr_ratio / final_lr

Common DanLing LRScheduler kwargs forwarded when present.

sched.step_size / milestones / gamma / T_max / eta_min / patience / factor

Common PyTorch scheduler kwargs forwarded when present.

I/O:

Name Type Description
workspace.root str

Root directory for experiments. Defaults to "experiments".

checkpoint str | None

Optional full-state checkpoint source for resume workflows. This is a path-like identifier consumed by runner load_checkpoint(...).

resume bool

Auto-resume from the backend-native latest checkpoint source when True.

pretrained str | None

Optional model-only checkpoint source for finetune workflows. This is a path-like identifier consumed by runner load_pretrained(...). Source priority is checkpoint > resume > pretrained.

workspace.lineage str

Top-level lineage namespace. Defaults to "lin" when unset. RunnerWorkspace.dir appends code identity (-<git_hash>) when available.

workspace.experiment str

Experiment namespace. Defaults to "exp".

ckpt.dir str

Checkpoint directory. Relative paths are resolved under workspace.dir. Defaults to "checkpoints".

ckpt.async_mode str

Checkpoint async behavior. Defaults to "async". Supported values: "disabled", "async", "async_with_pinned_mem".

ckpt.dedicated_async_process_group bool

Use a dedicated process group for async DCP checkpoint I/O to reduce interference with training collectives. Defaults to True.

ckpt.async_process_group_backend str

Backend for the dedicated async checkpoint process group. Defaults to "gloo".

ckpt.backend str

Checkpoint backend selected at runtime by the runner ("dcp" for distributed runs, "file" otherwise when set to "auto").

ckpt.wait_timeout_seconds float

Timeout in seconds when draining async checkpoint writes during runner shutdown (None waits indefinitely).

parallel.axes.replicate int

Data-replication degree for DDP/HSDP-style replication. Defaults to 1.

parallel.axes.shard int

Data-sharding degree for FSDP-style sharding. Defaults to 1. Set one parallel axis, commonly shard, to -1 to auto-fill it from WORLD_SIZE and the other configured axes.

parallel.axes.context int

Context/sequence parallel degree. Defaults to 1.

parallel.axes.pipeline int

Pipeline-parallel degree. Defaults to 1.

parallel.axes.tensor int

Tensor-parallel degree. Defaults to 1.

parallel.axes.expert int

Expert-parallel degree for MoE models. Defaults to 1.

parallel.axes.expert_tensor int

Expert tensor-parallel degree for MoE models. Defaults to 1.

parallel.pipeline_schedule str

Pipeline schedule class name resolved by torch.distributed.pipelining.schedules.get_schedule_class. Defaults to "1F1B".

parallel.pipeline_microbatch_size int

Local microbatch size used to infer schedule microbatch count as dataloader.batch_size // pipeline_microbatch_size. Defaults to 1.

parallel.pipeline_microbatches int

Explicit schedule microbatch count. When set, overrides pipeline_microbatch_size-based inference.

parallel.pipeline_partitions list[list[str]] | None

Optional module FQNs for simple pipeline stage extraction. The outer list length is the total pipeline stage count and must be divisible by parallel.axes.pipeline; complex partitioning should use model.build_pipeline_model_part(...) or override ParallelRunner.build_pipeline_model_part / ParallelRunner.build_pipeline_model_parts.

logging.enabled bool

Whether to enable file logging. Defaults to True. Logging is initialized on the main process only.

logging.interval int

Iterations between log outputs. If None, auto-calculated.

logging.file str | None

Optional log file path. Defaults to workspace.dir/logs/{timestamp}.log.

tensorboard.enabled bool

Whether to use TensorBoard for visualization. Defaults to False.

tensorboard.log_dir str | None

Optional TensorBoard log directory. Defaults to workspace.dir/tensorboard/{timestamp}.

tensorboard.comment / purge_step / max_queue / flush_secs / filename_suffix

Optional torch.utils.tensorboard.SummaryWriter kwargs.

wandb.enabled bool

Whether to enable Weights & Biases scalar logging. Defaults to False.

wandb.project str | None

Optional W&B project name. Defaults to lineage.

wandb.entity str | None

Optional W&B entity/team override.

wandb.id str | None

Optional stable W&B run id.

wandb.group str | None

Optional W&B group name. Defaults to experiment.

wandb.name str | None

Optional W&B display name. Defaults to stable runner id.

wandb.notes str | None

Optional W&B run notes.

wandb.job_type str | None

Optional W&B job type.

wandb.tags list[str] | str | None

Optional W&B run tags.

wandb.dir str | None

Optional local W&B run directory. Defaults to run directory.

wandb.mode str | None

Optional W&B mode such as "online" or "offline".

wandb.resume / save_code / sync_tensorboard

Optional common W&B init kwargs.

ft.enabled bool

Enable TorchFT-managed fault tolerance. Defaults to False.

ft.process_group str

TorchFT coordination backend. Supported values: "gloo" and "nccl". Defaults to "gloo".

ft.process_group_timeout_seconds float

TorchFT process-group timeout in seconds. Defaults to 10.0.

ft.replica_id int

Replica-group identifier for this run. Defaults to 0.

ft.group_size int

Number of replica groups participating in TorchFT. Defaults to 1.

ft.min_replica_size int

Minimum healthy replicas required by TorchFT per step. Defaults to 1.

ckpt.interval int

Interval between checkpoint save attempts for latest/best. The same cadence is used for history checkpoints. Uses epochs in epoch mode and global steps in step mode. If unset, runner defaults are used by mode.

ckpt.keep_latest_k int

Number of framework-generated history checkpoints to retain. 0 disables retention pruning.

ckpt.enabled bool

Whether to persist checkpoints. Set False to allow loading while disabling writes.

ckpt.dataloader_checkpoint.enabled bool

Enable per-replica dataloader checkpoints. Uses DCP and stores checkpoints under ckpt.dataloader_checkpoint.prefix-{ckpt.dataloader_checkpoint.replica_id}.

ckpt.dataloader_checkpoint.replica_id str | None

Replica identifier used for dataloader checkpoint directory. Defaults to FT_REPLICA_ID environment variable, then process rank.

ckpt.dataloader_checkpoint.prefix str

Prefix used for per-replica dataloader checkpoint directories. Defaults to "dataloader-replica".

ckpt.export_dtype str

Optional dtype cast for model-only checkpoint export (fp32/fp16/bf16/fp64 aliases supported).

dataloader.batch_size int | None

Local dataloader batch size passed to StatefulDataLoader.

dataloader.shuffle bool | None

Optional shuffle override. When unset, train splits shuffle and non-train splits do not.

dataloader.sampler / batch_sampler / collate_fn

Optional DataLoader construction hooks forwarded to StatefulDataLoader.

dataloader.drop_last bool | None

Optional drop-last override. When unset, train splits drop incomplete batches and non-train splits keep them.

dataloader.num_workers / persistent_workers / prefetch_factor / pin_memory

Standard PyTorch DataLoader kwargs forwarded to StatefulDataLoader.

dataloader.in_order bool

PyTorch DataLoader ordering flag.

dataloader.snapshot_every_n_steps int | None

StatefulDataLoader snapshot cadence.

dataloader.<split> dict

Split-specific overrides merged on top of default dataloader kwargs, for example dataloader.train.shuffle=False.

fsdp.enabled bool

Enable FSDP2 wrapping in ParallelRunner. The FSDP mesh is derived from parallel.axes.replicate, parallel.axes.shard, and later parallel.axes.context.

fsdp.reshard_after_forward bool | int | None

Optional FSDP2 reshard policy.

fsdp.shard_placement_fn bool | int | None

Optional FSDP2 shard placement callable.

fsdp.mixed_precision_policy bool | int | None

Optional FSDP2 mixed precision policy.

fsdp.offload_policy bool | int | None

Optional FSDP2 CPU offload policy.

fsdp.ignored_params bool | int | None

Optional parameters excluded from FSDP2 wrapping.

compile.enabled bool

Whether to enable torch.compile for runner-selected model compilation points.

compile.backend str

Optional backend passed to torch.compile.

compile.fullgraph bool

Optional fullgraph flag for torch.compile.

compile.dynamic bool

Optional dynamic flag for torch.compile.

compile.mode str

Optional mode passed to torch.compile.

compile.options dict

Optional options passed to torch.compile.

compile.optimize_ddp str | None

Optional torch._dynamo.config.optimize_ddp value. Defaults to "ddp_optimizer" when model compile is enabled.

compile.precompile_artifact_dir str | None

Optional directory for GraphRunner torch compiler cache artifacts. Current eager runners ignore this setting.

compile.memory_policy str | None

Optional graph-memory policy label for experimental graph paths. GraphRunner currently accepts None/"default"; activation remat/offload policies require a dedicated graph pass pipeline.

dist.init_timeout_seconds int | None

Optional distributed process-group timeout used during initialization and early startup.

dist.train_timeout_seconds int | None

Optional tighter distributed process-group timeout applied once after the first successful optimizer step.

gc.interval int | None

Optional periodic Python GC cadence. When unset, runner-managed GC pacing is disabled.

gc.generation int

Python GC generation passed to gc.collect(...) when pacing is enabled. Defaults to 1.

gc.disable_automatic bool

Disable CPython automatic GC while runner-managed pacing is enabled. Defaults to True.

profiling.enabled bool

Enable bounded-step torch.profiler tracing. Defaults to False.

profiling.activities str | list[str] | None

Explicit profiler activities such as "cpu" or ["cpu", "cuda"]. When unset, CPU is used and CUDA is added for CUDA runners.

profiling.wait int

Profiler schedule wait steps before warmup. Defaults to 1.

profiling.warmup int

Profiler schedule warmup steps. Defaults to 1.

profiling.active int

Profiler schedule active trace steps. Defaults to 3.

profiling.repeat int | None

Optional profiler schedule repeat count.

profiling.record_shapes bool

Enable shape recording in traces. Defaults to False.

profiling.profile_memory bool

Enable profiler-side memory recording. Defaults to False.

profiling.with_stack bool

Include Python stack traces in profiler output. Defaults to False.

profiling.with_flops bool

Enable profiler FLOPs estimation when available. Defaults to False.

profiling.with_modules / acc_events / use_cuda

Optional profiler kwargs.

profiling.post_processing_timeout_seconds float | None

Optional profiler post-processing timeout in seconds.

profiling.trace_dir str

Relative or absolute trace output directory. Defaults to "profiles".

heartbeat.enabled bool

Enable a machine-readable per-rank heartbeat/progress file. Defaults to False.

heartbeat.interval_seconds float

Heartbeat write interval in seconds. Defaults to 60.0.

heartbeat.dir str

Heartbeat directory. Relative paths are resolved under workspace.dir. Defaults to "heartbeats".

Text Only
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Use in a runner
runner = Runner(config)
```

Custom config class with typed attributes:
```python
class TrainingConfig(RunnerConfig):
    # Type annotations provide auto-completion and validation
    epochs: int = 100
    batch_size: int = 32
    precision: str = "fp16"

    def __init__(self):
        super().__init__()
        # Initialize nested settings
        self.optim.type = "adamw"
        self.optim.lr = 1e-3

    def post(self):
        # Called after parsing CLI args
        super().post()
        # Create derived settings
        self.workspace.experiment = f"{self.network.type}_{self.optim.lr}"
```

Command-line integration:
```bash
# Override config settings via CLI
python train.py --epochs 50 --optim.lr 0.0005 --network.type resnet50
```
Note

Always store all parameters needed to reproduce a run in the RunnerConfig. The RunnerConfig is automatically saved with checkpoints, enabling exact resumption.

See Also
Source code in danling/runners/config.py
Python
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
class RunnerConfig(chanfig.Config):  # pylint: disable=too-many-instance-attributes
    r"""
    Configuration class for managing and persisting all states of a DanLing Runner.

    The RunnerConfig class provides a hierarchical configuration system that handles:

    1. **Parameter management**: Hyperparameters, model settings, training options
    2. **Experiment tracking**: IDs, names, and other metadata for runs and experiments
    3. **Serialization**: Save/load configurations from files or command line
    4. **Reproducibility**: Tracking seeds and settings for reproducible runs

    RunnerConfig inherits from [`Config`][chanfig.Config] and provides attribute-style access to nested values:

    ```python
    config = RunnerConfig()

    # Attribute-style access (recommended)
    config.optim.lr = 1e-3
    config.network.type = "resnet50"

    # Dictionary-style access (alternative)
    config["optim"]["lr"] = 1e-3
    config["network"]["type"] = "resnet50"
    ```

    RunnerConfig objects support three types of hierarchical attribute access patterns:

    1. **Direct assignment** for simple values:
       ```python
       config.epochs = 10
       ```

    2. **Auto-created nested objects** for hierarchical settings:
       ```python
       # Auto-creates the nested objects
       config.optim.lr = 0.01
       config.optim.weight_decay = 1e-4
       ```

    3. **Class-level annotations** for typed properties with defaults:
       ```python
       class MyConfig(RunnerConfig):
           epochs: int = 10
           learning_rate: float = 0.001
       ```

    Command-line integration is built-in. You can define a configuration and
    then override values via command line arguments:

    ```python
    config = MyConfig()
    config.parse()  # Parse CLI args, e.g., --epochs 20 --optim.lr 0.01
    ```

    Attributes: General:
        stack (str): Runner stack selector used by `danling.runners.Runner`.
            Supported values: `"auto"`, `"ddp"`/`"torch"`, `"graph"`,
            `"deepspeed"`/`"ds"`, `"parallel"`.
            Defaults to `"auto"` (resolved to `"ddp"` at runtime).

    Attributes: Reproducibility:
        seed (int): Random seed for reproducibility. If not set, a random value is generated.
        deterministic (bool): Whether to enforce deterministic operations in PyTorch.
            Defaults to `False` for better performance. Set to `True` for exact reproducibility.

    Attributes: Progress:
        steps (int | None): Final global step target for training.
            In step mode, training stops when `global_step >= steps`.
        epochs (int | None): Final epoch index boundary for training.
            In epoch mode, training iterates epochs until `epoch == epochs`.
        accum_steps (int): Number of micro-batches per optimizer step.
            Defaults to `1`.

    Attributes: Model Evaluation:
        score.split (str): Dataset split to use for model selection. Defaults to None.
            If unset, runner infers once (`val` -> `validate` -> first available) and reuses it
            unless that split disappears from results.
        score.metric (str): Metric key to use for model selection. Defaults to "loss".
        score.patience (int | float): Early-stop patience in epoch mode.
            Defaults to infinity.
        sched.interval (str): Scheduler advancement policy.
            Supported values: `"step"` and `"epoch"`/`"validation"`.
            Non-metric schedulers default to `"step"`. Metric schedulers such as
            `ReduceLROnPlateau` default to `"epoch"` and advance after the aggregated
            round result is available.
        sched.monitor (str): Optional metric selector for metric schedulers.
            Supports dotted paths such as `"val.loss"`.
            When unset, the runner prefers `score.split`/`score.metric` when available and
            otherwise resolves `score.metric` from the aggregated result.

    Attributes: Optimization:
        optim.type (str | None): Optimizer registry key, for example `"adamw"` or `"sgd"`.
            When unset, the runner does not auto-build an optimizer.
        optim.lr / weight_decay / betas / eps / momentum: Common optimizer
            kwargs forwarded to the optimizer registry when present.
        optim.param_groups (list[dict] | None): Optional regex-based optimizer
            parameter groups. Each entry requires `pattern`, matched against
            `TorchRunner.iter_optimizer_named_parameters()` with `re.search`
            semantics, and may provide optimizer group options directly. Anchor
            patterns with `^`/`$` when a full FQN position matters.
            `lr_multiplier`,
            `weight_decay_multiplier`, `beta1`, and `beta2` derive group values
            from top-level `optim.lr`, `optim.weight_decay`, and `optim.betas`.
            Unmatched parameters keep the optimizer-level defaults.
        sched.type (str | None): Scheduler registry key, for example `"cosine"`,
            `"linear"`, `"step"`, or `"reduce_on_plateau"`. When unset, the runner
            does not auto-build a scheduler.
        sched.total_steps / warmup_steps / cooldown_steps / final_lr_ratio / final_lr:
            Common DanLing `LRScheduler` kwargs forwarded when present.
        sched.step_size / milestones / gamma / T_max / eta_min / patience / factor:
            Common PyTorch scheduler kwargs forwarded when present.

    Attributes: I/O:
        workspace.root (str): Root directory for experiments. Defaults to `"experiments"`.
        checkpoint (str | None): Optional full-state checkpoint source for resume workflows.
            This is a path-like identifier consumed by runner `load_checkpoint(...)`.
        resume (bool): Auto-resume from the backend-native latest checkpoint source when `True`.
        pretrained (str | None): Optional model-only checkpoint source for finetune workflows.
            This is a path-like identifier consumed by runner `load_pretrained(...)`.
            Source priority is `checkpoint` > `resume` > `pretrained`.
        workspace.lineage (str): Top-level lineage namespace.
            Defaults to `"lin"` when unset.
            `RunnerWorkspace.dir` appends code identity (`-<git_hash>`) when available.
        workspace.experiment (str): Experiment namespace. Defaults to `"exp"`.
        ckpt.dir (str): Checkpoint directory. Relative paths are resolved under `workspace.dir`.
            Defaults to `"checkpoints"`.
        ckpt.async_mode (str): Checkpoint async behavior. Defaults to `"async"`.
            Supported values: `"disabled"`, `"async"`, `"async_with_pinned_mem"`.
        ckpt.dedicated_async_process_group (bool): Use a dedicated process group for async DCP
            checkpoint I/O to reduce interference with training collectives. Defaults to `True`.
        ckpt.async_process_group_backend (str): Backend for the dedicated async checkpoint process
            group. Defaults to `"gloo"`.
        ckpt.backend (str): Checkpoint backend selected at runtime by the runner
            (`"dcp"` for distributed runs, `"file"` otherwise when set to `"auto"`).
        ckpt.wait_timeout_seconds (float): Timeout in seconds when draining async checkpoint writes
            during runner shutdown (`None` waits indefinitely).
        parallel.axes.replicate (int): Data-replication degree for DDP/HSDP-style replication.
            Defaults to `1`.
        parallel.axes.shard (int): Data-sharding degree for FSDP-style sharding.
            Defaults to `1`. Set one parallel axis, commonly `shard`, to `-1`
            to auto-fill it from `WORLD_SIZE` and the other configured axes.
        parallel.axes.context (int): Context/sequence parallel degree. Defaults to `1`.
        parallel.axes.pipeline (int): Pipeline-parallel degree. Defaults to `1`.
        parallel.axes.tensor (int): Tensor-parallel degree. Defaults to `1`.
        parallel.axes.expert (int): Expert-parallel degree for MoE models. Defaults to `1`.
        parallel.axes.expert_tensor (int): Expert tensor-parallel degree for MoE models. Defaults to `1`.
        parallel.pipeline_schedule (str): Pipeline schedule class name resolved by
            `torch.distributed.pipelining.schedules.get_schedule_class`.
            Defaults to `"1F1B"`.
        parallel.pipeline_microbatch_size (int): Local microbatch size used to infer
            schedule microbatch count as `dataloader.batch_size // pipeline_microbatch_size`.
            Defaults to `1`.
        parallel.pipeline_microbatches (int): Explicit schedule microbatch count.
            When set, overrides `pipeline_microbatch_size`-based inference.
        parallel.pipeline_partitions (list[list[str]] | None): Optional
            module FQNs for simple pipeline stage extraction. The outer list
            length is the total pipeline stage count and must be divisible by
            `parallel.axes.pipeline`; complex partitioning should use
            `model.build_pipeline_model_part(...)` or override
            `ParallelRunner.build_pipeline_model_part` /
            `ParallelRunner.build_pipeline_model_parts`.
        logging.enabled (bool): Whether to enable file logging. Defaults to `True`.
            Logging is initialized on the main process only.
        logging.interval (int): Iterations between log outputs. If None, auto-calculated.
        logging.file (str | None): Optional log file path.
            Defaults to `workspace.dir/logs/{timestamp}.log`.
        tensorboard.enabled (bool): Whether to use TensorBoard for visualization. Defaults to `False`.
        tensorboard.log_dir (str | None): Optional TensorBoard log directory.
            Defaults to `workspace.dir/tensorboard/{timestamp}`.
        tensorboard.comment / purge_step / max_queue / flush_secs / filename_suffix:
            Optional `torch.utils.tensorboard.SummaryWriter` kwargs.
        wandb.enabled (bool): Whether to enable Weights & Biases scalar logging. Defaults to `False`.
        wandb.project (str | None): Optional W&B project name. Defaults to `lineage`.
        wandb.entity (str | None): Optional W&B entity/team override.
        wandb.id (str | None): Optional stable W&B run id.
        wandb.group (str | None): Optional W&B group name. Defaults to `experiment`.
        wandb.name (str | None): Optional W&B display name. Defaults to stable runner `id`.
        wandb.notes (str | None): Optional W&B run notes.
        wandb.job_type (str | None): Optional W&B job type.
        wandb.tags (list[str] | str | None): Optional W&B run tags.
        wandb.dir (str | None): Optional local W&B run directory. Defaults to run directory.
        wandb.mode (str | None): Optional W&B mode such as `"online"` or `"offline"`.
        wandb.resume / save_code / sync_tensorboard: Optional common W&B init kwargs.
        ft.enabled (bool): Enable TorchFT-managed fault tolerance. Defaults to `False`.
        ft.process_group (str): TorchFT coordination backend. Supported values: `"gloo"` and `"nccl"`.
            Defaults to `"gloo"`.
        ft.process_group_timeout_seconds (float): TorchFT process-group timeout in seconds.
            Defaults to `10.0`.
        ft.replica_id (int): Replica-group identifier for this run. Defaults to `0`.
        ft.group_size (int): Number of replica groups participating in TorchFT. Defaults to `1`.
        ft.min_replica_size (int): Minimum healthy replicas required by TorchFT per step.
            Defaults to `1`.
        ckpt.interval (int): Interval between checkpoint save attempts for `latest`/`best`.
            The same cadence is used for history checkpoints.
            Uses epochs in epoch mode and global steps in step mode.
            If unset, runner defaults are used by mode.
        ckpt.keep_latest_k (int): Number of framework-generated history checkpoints to retain.
            `0` disables retention pruning.
        ckpt.enabled (bool): Whether to persist checkpoints. Set `False` to allow loading while disabling writes.
        ckpt.dataloader_checkpoint.enabled (bool): Enable per-replica dataloader checkpoints.
            Uses DCP and stores checkpoints under
            `ckpt.dataloader_checkpoint.prefix-{ckpt.dataloader_checkpoint.replica_id}`.
        ckpt.dataloader_checkpoint.replica_id (str | None): Replica identifier used for dataloader checkpoint directory.
            Defaults to `FT_REPLICA_ID` environment variable, then process rank.
        ckpt.dataloader_checkpoint.prefix (str): Prefix used for per-replica dataloader checkpoint directories.
            Defaults to `"dataloader-replica"`.
        ckpt.export_dtype (str): Optional dtype cast for model-only checkpoint export
            (`fp32`/`fp16`/`bf16`/`fp64` aliases supported).
        dataloader.batch_size (int | None): Local dataloader batch size passed to
            `StatefulDataLoader`.
        dataloader.shuffle (bool | None): Optional shuffle override. When unset, train
            splits shuffle and non-train splits do not.
        dataloader.sampler / batch_sampler / collate_fn: Optional DataLoader
            construction hooks forwarded to `StatefulDataLoader`.
        dataloader.drop_last (bool | None): Optional drop-last override. When unset,
            train splits drop incomplete batches and non-train splits keep them.
        dataloader.num_workers / persistent_workers / prefetch_factor / pin_memory:
            Standard PyTorch DataLoader kwargs forwarded to `StatefulDataLoader`.
        dataloader.in_order (bool): PyTorch DataLoader ordering flag.
        dataloader.snapshot_every_n_steps (int | None): StatefulDataLoader snapshot cadence.
        dataloader.<split> (dict): Split-specific overrides merged on top of default
            dataloader kwargs, for example `dataloader.train.shuffle=False`.
        fsdp.enabled (bool): Enable FSDP2 wrapping in `ParallelRunner`.
            The FSDP mesh is derived from `parallel.axes.replicate`,
            `parallel.axes.shard`, and later `parallel.axes.context`.
        fsdp.reshard_after_forward (bool | int | None): Optional FSDP2 reshard policy.
        fsdp.shard_placement_fn: Optional FSDP2 shard placement callable.
        fsdp.mixed_precision_policy: Optional FSDP2 mixed precision policy.
        fsdp.offload_policy: Optional FSDP2 CPU offload policy.
        fsdp.ignored_params: Optional parameters excluded from FSDP2 wrapping.
        compile.enabled (bool): Whether to enable `torch.compile` for runner-selected model compilation points.
        compile.backend (str): Optional backend passed to `torch.compile`.
        compile.fullgraph (bool): Optional `fullgraph` flag for `torch.compile`.
        compile.dynamic (bool): Optional `dynamic` flag for `torch.compile`.
        compile.mode (str): Optional mode passed to `torch.compile`.
        compile.options (dict): Optional options passed to `torch.compile`.
        compile.optimize_ddp (str | None): Optional `torch._dynamo.config.optimize_ddp` value.
            Defaults to `"ddp_optimizer"` when model compile is enabled.
        compile.precompile_artifact_dir (str | None): Optional directory for GraphRunner torch compiler
            cache artifacts. Current eager runners ignore this setting.
        compile.memory_policy (str | None): Optional graph-memory policy label for experimental graph paths.
            GraphRunner currently accepts `None`/`"default"`; activation remat/offload policies require a
            dedicated graph pass pipeline.
        dist.init_timeout_seconds (int | None): Optional distributed process-group timeout used during
            initialization and early startup.
        dist.train_timeout_seconds (int | None): Optional tighter distributed process-group timeout applied
            once after the first successful optimizer step.
        gc.interval (int | None): Optional periodic Python GC cadence.
            When unset, runner-managed GC pacing is disabled.
        gc.generation (int): Python GC generation passed to `gc.collect(...)` when pacing is enabled.
            Defaults to `1`.
        gc.disable_automatic (bool): Disable CPython automatic GC while runner-managed pacing is enabled.
            Defaults to `True`.
        profiling.enabled (bool): Enable bounded-step `torch.profiler` tracing. Defaults to `False`.
        profiling.activities (str | list[str] | None): Explicit profiler activities such as
            `"cpu"` or `["cpu", "cuda"]`. When unset, CPU is used and CUDA is added
            for CUDA runners.
        profiling.wait (int): Profiler schedule wait steps before warmup. Defaults to `1`.
        profiling.warmup (int): Profiler schedule warmup steps. Defaults to `1`.
        profiling.active (int): Profiler schedule active trace steps. Defaults to `3`.
        profiling.repeat (int | None): Optional profiler schedule repeat count.
        profiling.record_shapes (bool): Enable shape recording in traces. Defaults to `False`.
        profiling.profile_memory (bool): Enable profiler-side memory recording. Defaults to `False`.
        profiling.with_stack (bool): Include Python stack traces in profiler output. Defaults to `False`.
        profiling.with_flops (bool): Enable profiler FLOPs estimation when available. Defaults to `False`.
        profiling.with_modules / acc_events / use_cuda: Optional profiler kwargs.
        profiling.post_processing_timeout_seconds (float | None): Optional profiler
            post-processing timeout in seconds.
        profiling.trace_dir (str): Relative or absolute trace output directory. Defaults to `"profiles"`.
        heartbeat.enabled (bool): Enable a machine-readable per-rank heartbeat/progress file. Defaults to `False`.
        heartbeat.interval_seconds (float): Heartbeat write interval in seconds. Defaults to `60.0`.
        heartbeat.dir (str): Heartbeat directory. Relative paths are resolved under `workspace.dir`.
            Defaults to `"heartbeats"`.
    Examples:
        Basic usage:
        ```python
        # Create a config
        config = RunnerConfig()
        config.network.type = "resnet18"
        config.optim.lr = 0.001
        config.epochs = 10

        # Use in a runner
        runner = Runner(config)
        ```

        Custom config class with typed attributes:
        ```python
        class TrainingConfig(RunnerConfig):
            # Type annotations provide auto-completion and validation
            epochs: int = 100
            batch_size: int = 32
            precision: str = "fp16"

            def __init__(self):
                super().__init__()
                # Initialize nested settings
                self.optim.type = "adamw"
                self.optim.lr = 1e-3

            def post(self):
                # Called after parsing CLI args
                super().post()
                # Create derived settings
                self.workspace.experiment = f"{self.network.type}_{self.optim.lr}"
        ```

        Command-line integration:
        ```bash
        # Override config settings via CLI
        python train.py --epochs 50 --optim.lr 0.0005 --network.type resnet50
        ```

    Note:
        Always store all parameters needed to reproduce a run in the RunnerConfig.
        The RunnerConfig is automatically saved with checkpoints, enabling exact resumption.

    See Also:
        - [`Runner`][danling.runners.Runner]: Main runner class that uses this config.
        - [`chanfig.Config`](https://github.com/ultmaster/chanfig): Base config implementation.
    """

    # DO NOT set default value in class, as they won't be stored in `__dict__`.

    stack: str = "auto"
    name: Optional[str] = None

    seed: Optional[int] = None
    deterministic: bool = False

    steps: Optional[int] = None
    epochs: Optional[int] = None
    accum_steps: int = 1
    train_splits: Union[Sequence[str], str, None] = None
    evaluate_splits: Union[Sequence[str], str, None] = None
    precision: Optional[str] = None
    max_grad_value: Optional[float] = None
    max_grad_norm: Optional[float] = None
    skip_nonfinite_grad: bool = False

    checkpoint: Optional[str] = None
    resume: bool = False
    pretrained: Optional[str] = None

    optim: OptimizerConfig
    sched: SchedulerConfig
    fp8: Fp8Config
    deepspeed: Optional[Mapping[str, Any]] = None

    score: ScoreConfig
    workspace: WorkspaceConfig
    logging: LoggingConfig
    tensorboard: TensorboardConfig
    wandb: WandbConfig
    ft: FaultToleranceConfig

    compile: CompileConfig
    dist: DistributedConfig
    gc: GcConfig
    profiling: ProfilingConfig
    heartbeat: HeartbeatConfig
    ckpt: CheckpointConfig
    dataloader: DataloaderConfig
    fsdp: FsdpConfig
    parallel: ParallelConfig

    def __post_init__(self, *args, **kwargs) -> None:
        super().__post_init__(*args, **kwargs)
        if not isinstance(self.tensorboard, TensorboardConfig):
            self.tensorboard = TensorboardConfig()
        self.validate()

    def post(self) -> None:
        super().post()
        self.validate()

    def validate(self) -> None:
        if self.steps is not None and self.epochs is not None:
            raise ValueError("`steps` and `epochs` are mutually exclusive; set only one training boundary")

    @staticmethod
    def _semantic_section(section: Mapping[str, Any]) -> chanfig.NestedDict:
        return chanfig.NestedDict({key: value for key, value in section.items() if value is not None})

    def canonical(self) -> chanfig.NestedDict:
        canonical = chanfig.NestedDict(self.dict())
        stack = normalize_stack_name(canonical.get("stack", "auto"))
        canonical["stack"] = stack
        for key in NON_SEMANTIC_CONFIG_KEYS:
            canonical.pop(key, None)

        ckpt = canonical.get("ckpt")
        if isinstance(ckpt, Mapping):
            semantic_ckpt = chanfig.NestedDict(ckpt)
            backend = semantic_ckpt.get("backend")
            if backend is not None:
                backend = str(backend).strip().lower()
                if backend == "auto":
                    semantic_ckpt.pop("backend", None)
                else:
                    semantic_ckpt["backend"] = backend
            for key in NON_SEMANTIC_CKPT_KEYS:
                semantic_ckpt.pop(key, None)
            if semantic_ckpt:
                canonical["ckpt"] = semantic_ckpt
            else:
                canonical.pop("ckpt", None)

        for key in ("optim", "sched"):
            section = canonical.get(key)
            if isinstance(section, Mapping):
                semantic_section = self._semantic_section(section)
                if semantic_section:
                    canonical[key] = semantic_section
                else:
                    canonical.pop(key, None)

        if stack != "parallel":
            canonical.pop("fsdp", None)
            canonical.pop("parallel", None)
        return canonical

    def __hash__(self) -> int:
        digest = hashlib.sha1(self.canonical().yamls().encode("utf-8")).digest()
        return int.from_bytes(digest[:8], byteorder="big", signed=False)