DanLing Runner
The Runner module provides a unified interface for the complete deep learning model lifecycle, supporting training, evaluation, inference, and experiment management across multiple distributed computing platforms.
Core Concepts
DanLing uses a two-level architecture with Runner
+ Config
to provide a flexible, extensible framework:
Config
Config
is a specialized dictionary that stores all serializable state, including:
- Hyperparameters (learning rate, batch size, etc.)
- Model configuration
- Dataset settings
Config extends the chanfig.Config
, it is hierarchical and can be accessed via attributes:
Python |
---|
| config = dl.Config()
config.optim.lr = 1e-3
config.dataloader.batch_size = 32
config.precision = "fp16"
|
Configurations can be serialized to YAML/JSON and loaded from the command line:
Python |
---|
| config = dl.Config()
config.parse() # Loads from command line arguments
config.dump("config.yaml") # Dump to yaml file
|
By default, the Config
will load the config file specified in --config config.yaml
.
Runner
Runner
is the central class that manages the entire model lifecycle. Key components include:
- State Management: Training/evaluation state and progress tracking
- Model Handling: Loading/saving model checkpoints
- Operations: Training, evaluation, and inference loops
- Metrics: Tracking and logging performance metrics
- Distributed Execution: Managing multi-device/multi-node execution
DanLing supports multiple distributed computing platforms:
- TorchRunner: Native PyTorch DistributedDataParallel (DDP) implementation
- DeepSpeedRunner: Microsoft DeepSpeed integration for large models
- AccelerateRunner: HuggingFace Accelerate for simplified multi-platform execution
The base Runner
class automatically selects the appropriate platform based on your configuration and available packages.
Customizing Your Workflow
Custom Configuration
Create a configuration class for better IDE support and documentation:
Python |
---|
| class MyConfig(dl.Config):
# Class-level defaults
# The values will be copied into the config object once its initialized.
# Note that you must specify a proper type hint, otherwise they will be considered as attributes for maintaining the object.
# Use `typing.Any` if no type hints are appropriate.
# If you wish to access the attributes for maintaining your customized Config class, use config.getattr("attr_name", default_value).
epoch_end: int = 10
log_interval: int = 100
score_metric: str = "accuracy"
def __init__(self):
super().__init__()
self.network.type = "resnet50"
self.optim.type = "adamw"
self.optim.lr = 5e-5
# Called after parsing command line arguments
# If no command line support needed, call `config.boot` for post-processing.
def post(self):
super().post()
# Generate a descriptive experiment name
self.experiment_name = f"{self.network.type}_{self.optim.lr}"
|
Optimizers and Schedulers
DanLing supports all PyTorch optimizers plus DeepSpeed optimizers when available:
Python |
---|
| from danling.optim import OPTIMIZERS, SCHEDULERS
# Using string identifiers (recommended for config-driven approach)
config.optim.type = "adamw" # Options: sgd, adam, adamw, lamb, lion, etc.
config.optim.lr = 1e-3
config.optim.weight_decay = 0.01
optimizer = OPTIMIZERS.build(model.parameters(), **config.optim)
# Using the registry directly
optimizer = OPTIMIZERS.build("sgd", params=model.parameters(), lr=1e-3)
# Scheduler configuration
config.sched.type = "cosine" # Options: step, cosine, linear, etc.
config.sched.total_steps = 100
scheduler = SCHEDULERS.build(optimizer, **config.sched)
|
Enable mixed precision training for faster execution:
Python |
---|
| config.precision = "fp16" # Options: fp16, bf16
config.accum_steps = 4 # Gradient accumulation steps
|
Custom Metrics
Register custom metrics or use built-in ones:
Python |
---|
| # Built-in metrics
runner.metrics = dl.metric.multiclass_metrics(num_classes=10)
# Custom metrics
from torchmetrics import Accuracy
runner.metrics = dl.MetricMeters({
"accuracy": Accuracy(task="multiclass", num_classes=10),
"my_custom_metric": MyCustomMetric()
})
|
Extending DanLing
DanLing is designed to be extensible at multiple levels:
Extension Pattern 1: Customize the Runner
Extend the Runner
class (not TorchRunner directly) to preserve platform selection:
Python |
---|
| class MyCustomRunner(dl.Runner):
"""Custom runner that works across all platforms."""
def __init__(self, config):
super().__init__(config)
# Your initialization code
def custom_method(self):
"""Add new functionality."""
return "Custom result"
# Override lifecycle methods when needed
def build_dataloaders(self):
"""Customize dataloader creation."""
super().build_dataloaders()
# Your modifications
|
Extension Pattern 2: Custom Distributed Framework
Extend TorchRunner
only when implementing a new distributed training framework:
Python |
---|
| class MyDistributedRunner(dl.TorchRunner):
"""Implement a custom distributed training framework."""
def init_distributed(self):
"""Initialize custom distributed environment."""
# Your distributed initialization code
def backward(self, loss):
"""Custom backward pass."""
# Your custom backward implementation
|
Lifecycle Methods
Key methods you can override:
Python |
---|
| # Data handling
def build_dataloaders(self)
def collate_fn(self, batch)
# Training/evaluation
def train_step(self, data)
def evaluate_step(self, data)
def advance(self, loss)
# Checkpointing
def save_checkpoint(self, name="latest")
def load_checkpoint(self, checkpoint)
|
DanLing’s Runner
automatically selects the appropriate platform using this logic:
- Check the
platform
config value ("auto"
, "torch"
, "deepspeed"
, or "accelerate"
)
- If
"auto"
, select DeepSpeed if available, otherwise use PyTorch
- Dynamically transform the Runner into the selected platform implementation
You can explicitly select a platform:
Python |
---|
| config = dl.Config({
"platform": "deepspeed", # Force DeepSpeed
"deepspeed": {
# DeepSpeed-specific configuration
"zero_optimization": {"stage": 2}
}
})
|
Platform |
Best For |
Key Features |
TorchRunner |
Flexibility, custom extensions |
Native PyTorch DDP, most customizable |
DeepSpeedRunner |
Very large models (billions of parameters) |
ZeRO optimization, CPU/NVMe offloading |
AccelerateRunner |
Multi-platform compatibility |
Simple API, works on CPU/GPU/TPU |
Experiment Management
DanLing organizes experiments in a hierarchical system:
Text Only |
---|
| {project_root}/
├── {experiment_name}-{run_name}-{id}/ # Unique run directory
│ ├── checkpoints/ # Model checkpoints
│ │ ├── best/ # Best model
│ │ ├── latest/ # Most recent model
│ │ └── epoch-{N}/ # Periodic checkpoints
│ ├── run.log # Execution logs
│ ├── runner.yaml # Runner configuration
│ ├── results.json # Complete results
│ ├── latest.json # Latest results
│ └── best.json # Best results
└── ...
|
Identifiers
DanLing provides both human-friendly and unique identifiers:
experiment_name
: Human-readable name for the experiment
run_name
: Human-readable name for the specific run
experiment_id
and run_id
: Automatically generated unique IDs
id
: Combined unique identifier
Checkpointing
Save and restore checkpoints:
Python |
---|
| # Automatic checkpointing (controlled by config)
config.save_interval = 5 # Save every 5 epochs
config.checkpoint_dir = "checkpoints" # Custom checkpoint directory
# Manual checkpointing
runner.save_checkpoint(name="custom_checkpoint")
runner.load_checkpoint("path/to/checkpoint")
# Resuming from previous run
config.checkpoint = "path/to/previous/checkpoint"
runner = MyRunner(config) # Will resume from checkpoint
|
Production and MLOps
Reproducibility
Ensure reproducible results:
Python |
---|
| config.seed = 42 # Set random seed
config.deterministic = True # Force deterministic algorithms
|
Logging and Visualization
Configure logging and TensorBoard:
Python |
---|
| config.log = True # Enable file logging
config.tensorboard = True # Enable TensorBoard
config.log_interval = 100 # Log every 100 iterations
|
Distributed Training
Configure multi-GPU and multi-node training:
Python |
---|
| # Single-node, multi-GPU
config.platform = "torch" # or "deepspeed" or "accelerate"
# Environment variables will be used: WORLD_SIZE, RANK, LOCAL_RANK
# Manual configuration
config.world_size = 4 # Number of processes
config.rank = 0 # Global rank
config.local_rank = 0 # Local rank
|
Examples
DanLing includes several example implementations:
MNIST with PyTorch
Complete image classification example using TorchRunner:
IMDB Text Classification with Accelerate
Text classification using HuggingFace Transformers and Accelerate:
For more examples, see the demo directory.