Tensor
The danling.tensor
module provides utilities for handling tensors with variable lengths in batched operations.
The core feature is the NestedTensor
class which allows efficient representation of sequences of different lengths without excessive padding.
Overview
In many deep learning tasks, especially those involving sequences (text, time series, etc.), each example in a batch may have a different length. Traditional approaches include:
- Padding: Adding placeholder values to make all examples the same length (wastes computation)
- Bucketing: Grouping similar-length examples (complicates training)
- Processing one sample at a time: Slow and inefficient
The NestedTensor
solves these problems by providing:
- A way to store variable-length tensors in a single object
- Automatic padding and mask generation for efficient computation
- Transparent access to the original tensors or padded representations
- PyTorch-like operations on nested structures
Key Components
The module consists of several key components:
NestedTensor
: Main class for handling variable-length tensors in a batch.
PNTensor
: A tensor wrapper that can be automatically converted to NestedTensor by PyTorch DataLoader.
tensor()
: Function to create a PNTensor
object (similar to torch.tensor()
).
TorchFuncRegistry
: Registry for extending PyTorch functions to work with NestedTensor
.
functional
: Helper functions for padding, masking, and tensor manipulation.
Quick Start
Creating a NestedTensor
Python |
---|
| import torch
from danling.tensor import NestedTensor
# Create from a list of tensors with different lengths
tensor1 = torch.tensor([1, 2, 3])
tensor2 = torch.tensor([4, 5])
nested = NestedTensor(tensor1, tensor2)
# Access properties
print(nested.tensor) # Padded tensor: [[1, 2, 3], [4, 5, 0]]
print(nested.mask) # Mask: [[True, True, True], [True, True, False]]
print(nested.concat) # Concatenated: [1, 2, 3, 4, 5]
# Index operations
print(nested[0]) # First tensor: [1, 2, 3]
print(nested[:, 1:]) # Slice: NestedTensor([[2, 3], [5, 0]])
|
Creating from Non-Tensor Data
Python |
---|
| from danling.tensor import NestedTensor
# Create directly from lists
nested = NestedTensor([1, 2, 3], [4, 5])
print(nested.tolist()) # [[1, 2, 3], [4, 5]]
# Create with different types
nested = NestedTensor(["hello", "world"], ["this", "is", "a", "test"])
|
Working with NestedTensor
Operations
NestedTensor supports many PyTorch operations:
Python |
---|
| # Arithmetic operations
result = nested + 10
result = nested * 2
# Type conversion
float_nested = nested.float()
half_nested = nested.half()
# Device movement
gpu_nested = nested.cuda()
cpu_nested = gpu_nested.cpu()
# Shape operations
print(nested.shape) # torch.Size([2, 3])
print(nested.size(0)) # 2
|
Unpacking
You can easily convert back to original tensors:
Python |
---|
| # Get as a list of lists
data = nested.tolist()
# Get as a tuple of (padded_tensor, mask)
tensor, mask = nested[:]
# Access individual items
first_item = nested[0] # Returns the first tensor
|
Integration with PyTorch DataLoader
The PNTensor
class makes it easy to use NestedTensor
with PyTorch’s DataLoader:
Python |
---|
| from torch.utils.data import Dataset, DataLoader
from danling.tensor import PNTensor
class VariableLengthDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
# Return a PNTensor, which will be automatically
# collated into a NestedTensor
return PNTensor(self.data[idx])
# Example usage
dataset = VariableLengthDataset([
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
])
dataloader = DataLoader(dataset, batch_size=3)
# The batches will be NestedTensor objects
for batch in dataloader:
print(type(batch)) # <class 'danling.tensor.nested_tensor.NestedTensor'>
print(batch.tensor) # Padded tensor
print(batch.mask) # Mask
|
Advanced Usage
Custom Collation
If you need more control over collation:
Python |
---|
| from torch.utils.data import DataLoader
from danling.tensor import NestedTensor
def custom_collate_fn(batch):
return NestedTensor(*batch)
dataloader = DataLoader(
dataset,
batch_size=32,
collate_fn=custom_collate_fn
)
|
Working with Masked Models
NestedTensor
works well with models that support attention masks:
Python |
---|
| # For transformer models
outputs = model(
input_ids=nested_inputs.tensor,
attention_mask=nested_inputs.mask
)
|
Extending PyTorch Functions
You can extend PyTorch functions to work with NestedTensor:
Python |
---|
| from danling.tensor.nested_tensor import NestedTensorFunc
import torch
@NestedTensorFunc.implement(torch.softmax)
def softmax(tensor, dim=-1):
# Implement softmax for NestedTensor
return tensor.nested_like(torch.softmax(tensor.tensor, dim=dim))
|