DanLing Runner¶
The Runner module provides a unified interface for the complete deep learning model lifecycle, supporting training, evaluation, inference, and experiment management across multiple distributed computing platforms.
Core Concepts¶
DanLing uses a two-level architecture with Runner
+ Config
to provide a flexible, extensible framework:
Config¶
Config
is a specialized dictionary that stores all serializable state, including:
- Hyperparameters (learning rate, batch size, etc.)
- Model configuration
- Dataset settings
Config extends the chanfig.Config
, it is hierarchical and can be accessed via attributes:
Python | |
---|---|
Configurations can be serialized to YAML/JSON and loaded from the command line:
Python | |
---|---|
By default, the Config
will load the config file specified in --config config.yaml
.
Runner¶
Runner
is the central class that manages the entire model lifecycle. Key components include:
- State Management: Training/evaluation state and progress tracking
- Model Handling: Loading/saving model checkpoints
- Operations: Training, evaluation, and inference loops
- Metrics: Tracking and logging performance metrics
- Distributed Execution: Managing multi-device/multi-node execution
Available Platforms¶
DanLing supports multiple distributed computing platforms:
- TorchRunner: Native PyTorch DistributedDataParallel (DDP) implementation
- DeepSpeedRunner: Microsoft DeepSpeed integration for large models
- AccelerateRunner: HuggingFace Accelerate for simplified multi-platform execution
The base Runner
class automatically selects the appropriate platform based on your configuration and available packages.
Customizing Your Workflow¶
Custom Configuration¶
Create a configuration class for better IDE support and documentation:
Optimizers and Schedulers¶
DanLing supports all PyTorch optimizers plus DeepSpeed optimizers when available:
Mixed Precision and Performance Optimization¶
Enable mixed precision training for faster execution:
Python | |
---|---|
Custom Metrics¶
Register custom metrics or use built-in ones:
Python | |
---|---|
Extending DanLing¶
DanLing is designed to be extensible at multiple levels:
Extension Pattern 1: Customize the Runner¶
Extend the Runner
class (not TorchRunner directly) to preserve platform selection:
Extension Pattern 2: Custom Distributed Framework¶
Extend TorchRunner
only when implementing a new distributed training framework:
Lifecycle Methods¶
Key methods you can override:
Python | |
---|---|
Platform Selection¶
DanLing’s Runner
automatically selects the appropriate platform using this logic:
- Check the
platform
config value ("auto"
,"torch"
,"deepspeed"
, or"accelerate"
) - If
"auto"
, select DeepSpeed if available, otherwise use PyTorch - Dynamically transform the Runner into the selected platform implementation
You can explicitly select a platform:
Python | |
---|---|
Platform Comparison¶
Platform | Best For | Key Features |
---|---|---|
TorchRunner | Flexibility, custom extensions | Native PyTorch DDP, most customizable |
DeepSpeedRunner | Very large models (billions of parameters) | ZeRO optimization, CPU/NVMe offloading |
AccelerateRunner | Multi-platform compatibility | Simple API, works on CPU/GPU/TPU |
Experiment Management¶
DanLing organizes experiments in a hierarchical system:
Identifiers¶
DanLing provides both human-friendly and unique identifiers:
experiment_name
: Human-readable name for the experimentrun_name
: Human-readable name for the specific runexperiment_id
andrun_id
: Automatically generated unique IDsid
: Combined unique identifier
Checkpointing¶
Save and restore checkpoints:
Production and MLOps¶
Reproducibility¶
Ensure reproducible results:
Python | |
---|---|
Logging and Visualization¶
Configure logging and TensorBoard:
Python | |
---|---|
Distributed Training¶
Configure multi-GPU and multi-node training:
Python | |
---|---|
Examples¶
DanLing includes several example implementations:
MNIST with PyTorch¶
Complete image classification example using TorchRunner:
Python | |
---|---|
IMDB Text Classification with Accelerate¶
Text classification using HuggingFace Transformers and Accelerate:
Python | |
---|---|
For more examples, see the demo directory.