Home AI/ML Transfer Learning, Fine-Tuning, and Domain Adaptation: A Complete Guide with Anomaly Detection for Heterogeneous Cobots

Transfer Learning, Fine-Tuning, and Domain Adaptation: A Complete Guide with Anomaly Detection for Heterogeneous Cobots

A Universal Robots UR5e and a FANUC CRX-10iA sit on the same production line, performing identical pick-and-place operations. Both have six joints, both lift the same payload, and both generate streams of torque, position, and velocity data every millisecond. Yet when you train an anomaly detection model on the UR5e’s data and deploy it on the FANUC — even though the task is identical — the model flags nearly everything as anomalous. The sensor noise profiles are different. The control loop frequencies don’t match. The calibration offsets create entirely different data distributions. You have a model that understands what “normal” looks like for one robot, but is completely blind to normalcy on another.

This is not a toy problem. As collaborative robots (cobots) proliferate across manufacturing, logistics, and healthcare, companies increasingly operate heterogeneous fleets — multiple brands, multiple generations, multiple firmware versions. Training a separate anomaly detection model for every brand is expensive, slow, and wasteful. What if the model could transfer its understanding of normal robot behavior across brands?

That is precisely what transfer learning, fine-tuning, and domain adaptation were built to solve. In this guide, we will dissect these three concepts — clarifying exactly how they relate to each other — and then apply them to a real-world scenario: building a cross-brand anomaly detection system for heterogeneous cobots. By the end, you will have not just theoretical understanding but complete, runnable PyTorch code for multiple adaptation strategies.

Key Takeaway: Transfer learning is the umbrella paradigm. Fine-tuning and domain adaptation are specific techniques within it. Understanding this hierarchy is essential before diving into implementation.

Before we go further, let’s establish the conceptual hierarchy that will frame this entire discussion:

Transfer Learning (broad paradigm)
├── Fine-Tuning (retrain pre-trained model on new data)
├── Domain Adaptation (bridge distribution gap between domains)
│   ├── Supervised Domain Adaptation
│   ├── Unsupervised Domain Adaptation (UDA)
│   └── Semi-Supervised Domain Adaptation
├── Feature Extraction (freeze pre-trained layers, train new head)
├── Multi-Task Learning (shared representations)
└── Zero-Shot / Few-Shot Transfer

Transfer learning is the big idea: take knowledge learned in one context and apply it in another. Fine-tuning is one way to do it — you take a pre-trained model and continue training it on your target data. Domain adaptation is another way — you specifically address the fact that your source and target data come from different distributions. Feature extraction, multi-task learning, and zero/few-shot transfer are additional strategies under the same umbrella. They are all siblings, not synonyms.

With that map in hand, let’s explore each territory in depth.

Transfer Learning — The Big Picture

Formal Definition

Transfer learning is the paradigm of leveraging knowledge acquired from a source task or domain to improve learning on a target task or domain. Formally, given a source domain DS with a learning task TS, and a target domain DT with a learning task TT, transfer learning aims to improve the learning of the target predictive function fT(·) using knowledge from DS and TS, where DS ≠ DT or TS ≠ TT.

In plain English: you’ve already spent resources learning something useful somewhere. Now you want to reuse that learning instead of starting from zero.

Why Transfer Learning Matters

The motivation is overwhelmingly practical:

  • Limited labeled data: Labeling anomalies in cobot sensor data requires domain experts who understand both the robot’s kinematics and the manufacturing process. You might have thousands of labeled samples for one robot brand but almost none for another.
  • Expensive annotation: Each labeled anomaly might require a robotics engineer to review hours of sensor logs. At $150/hour, labeling 10,000 samples across five brands could cost more than the robots themselves.
  • Faster convergence: A model initialized with transferred knowledge reaches acceptable performance in hours rather than weeks.
  • Better generalization: Features learned from large, diverse datasets often capture universal patterns that improve performance even on seemingly unrelated tasks.

Types of Transfer Learning

The taxonomy breaks down based on what differs between source and target:

Type Source Labels Target Labels Relationship Example
Inductive Transfer Available Available TS ≠ TT ImageNet classification → medical image segmentation
Transductive Transfer Available Not available DS ≠ DT, TS = TT UR5e anomaly detection → FANUC anomaly detection (no FANUC labels)
Unsupervised Transfer Not available Not available DS ≠ DT Self-supervised pre-training on all cobot data → clustering

 

For our cobot scenario, transductive transfer is the most relevant: we have labeled anomaly data from one or a few brands (source domains) and want to perform the same anomaly detection task on new brands (target domains) where labels are scarce or nonexistent.

When Transfer Learning Works — and When It Fails

Transfer learning is not magic. It works when the source and target share some underlying structure. A model trained on ImageNet transfers well to medical imaging because both involve recognizing edges, textures, and shapes. A model trained on English text transfers well to French because both languages share grammatical abstractions.

It fails — sometimes catastrophically — when the source and target are too dissimilar. This is called negative transfer: the transferred knowledge actively hurts performance on the target task. For example, a model trained on satellite imagery might transfer poorly to microscopy images despite both being “images.” The spatial scales, textures, and semantic meanings are fundamentally different.

Caution: Negative transfer is insidious because it can look like a model training problem. If your transferred model performs worse than a randomly initialized model, suspect negative transfer. The fix is usually to reduce the amount of knowledge transferred (freeze fewer layers) or reconsider whether transfer is appropriate at all.

In our cobot scenario, transfer learning is highly promising because the robots share the same fundamental kinematic structure. A 6-axis articulated arm generates torque profiles that follow similar physical laws regardless of brand. The differences are in sensor calibration, noise characteristics, and control system idiosyncrasies — exactly the kind of distribution shift that domain adaptation was designed to handle.

Historical Context

Transfer learning’s modern era began with the ImageNet revolution. In 2012, AlexNet showed that deep CNNs could learn powerful visual features. By 2014, researchers discovered that these features — especially from early layers — transferred remarkably well to other vision tasks. “ImageNet pre-training” became the default starting point for almost any computer vision project.

NLP followed a similar trajectory. Word2Vec and GloVe provided transferable word embeddings, but the real revolution came with BERT (2018) and GPT (2018-2019), which showed that pre-training on massive text corpora created representations that transferred to virtually any language task. Today, large language models are perhaps the ultimate transfer learning systems — pre-trained on trillions of tokens, then fine-tuned or prompted for specific tasks.

The time-series and industrial AI domains are now experiencing their own transfer learning moment. Models like Chronos, TimesFM, and Lag-Llama are emerging as foundation models for temporal data, and domain adaptation for sensor data is an active area of research with direct industrial applications.

Training From Scratch vs. Transfer Learning

Factor From Scratch Transfer Learning
Labeled data needed Large (10k–1M+ samples) Small (100–1k samples)
Training time Days to weeks Hours to days
Compute cost High (multi-GPU) Low to moderate (single GPU)
Performance (limited data) Poor (overfits) Good to excellent
Performance (abundant data) Excellent (eventually) Excellent (faster)
Domain expertise needed High (architecture design) Moderate (strategy selection)
Risk of negative transfer None Possible if domains too different

 

Fine-Tuning — Techniques and Strategies

Fine-tuning is the most widely used transfer learning technique: take a model pre-trained on a source task/domain, and continue training it on your target data. Simple in concept, nuanced in practice.

Full Fine-Tuning vs. Partial Fine-Tuning

Full fine-tuning updates all parameters of the pre-trained model. This gives the model maximum flexibility to adapt but also the highest risk of overfitting — especially when the target dataset is small. If you have 50,000 labeled samples in your target domain, full fine-tuning is usually safe. If you have 500, it’s dangerous.

Partial fine-tuning freezes some layers (typically earlier ones) and only updates the rest. The intuition is that early layers learn generic, transferable features (edge detectors in vision, basic temporal patterns in time-series), while later layers learn task-specific features. By freezing early layers, you preserve the generic knowledge and only adapt the task-specific parts.

Layer-Wise Learning Rate Decay (Discriminative Fine-Tuning)

Rather than the binary freeze/unfreeze decision, discriminative fine-tuning assigns different learning rates to different layers. Earlier layers get smaller learning rates (they should change slowly), while later layers get larger learning rates (they need more adaptation). A common approach is to multiply the learning rate by a decay factor for each layer moving backwards from the output:

# Discriminative learning rates in PyTorch
def get_discriminative_params(model, base_lr=1e-3, decay_factor=0.9):
    """Assign decreasing learning rates to earlier layers."""
    params = []
    layers = list(model.named_parameters())
    n_layers = len(layers)

    for i, (name, param) in enumerate(layers):
        # Earlier layers get smaller LR
        layer_lr = base_lr * (decay_factor ** (n_layers - i - 1))
        params.append({
            'params': param,
            'lr': layer_lr,
            'name': name
        })

    return params

# Usage
param_groups = get_discriminative_params(model, base_lr=1e-3, decay_factor=0.85)
optimizer = torch.optim.AdamW(param_groups)

Gradual Unfreezing

Gradual unfreezing starts by training only the final layer(s), then progressively unfreezes earlier layers as training proceeds. This prevents early layers from being corrupted by the large gradients that occur at the start of fine-tuning when the loss is high. The strategy was popularized by ULMFiT (Universal Language Model Fine-tuning) and works well for both NLP and time-series tasks.

The Fine-Tuning Decision Matrix

The right fine-tuning strategy depends on two factors: how much target data you have, and how similar the source and target domains are.

Scenario Target Data Size Domain Similarity Recommended Strategy
A Small (<1k) High Feature extraction only (freeze all, train classifier head)
B Small (<1k) Low Fine-tune final layers with aggressive regularization
C Large (>10k) High Full fine-tuning with small learning rate
D Large (>10k) Low Full fine-tuning or train from scratch

 

For cobots of the same kinematic structure but different brands, we are firmly in the high domain similarity column. If we have limited labeled data for the target brand (common), Scenario A applies — feature extraction or minimal fine-tuning. If we have substantial data, Scenario C applies — gentle full fine-tuning.

Regularization During Fine-Tuning

Fine-tuning on small datasets risks catastrophic forgetting — the model forgets what it learned during pre-training. Several regularization techniques help:

  • L2-SP (L2 penalty Starting Point): Instead of penalizing weights toward zero, penalize them toward their pre-trained values. This keeps the model close to the pre-trained solution while allowing adaptation.
  • Dropout: Especially effective when added to fine-tuning layers. Typical values: 0.1–0.3 during fine-tuning vs. 0.5 during training from scratch.
  • Early stopping: Monitor validation loss on the target domain and stop when it starts increasing. With small target datasets, overfitting can happen in just a few epochs.
  • Weight decay: Standard L2 regularization remains effective, typically at 0.01–0.1 during fine-tuning.

Modern Parameter-Efficient Fine-Tuning

Full fine-tuning updates millions or billions of parameters, which is computationally expensive and requires storing a full copy of the model per task. Parameter-efficient fine-tuning (PEFT) methods address this by updating only a small subset of parameters:

  • LoRA (Low-Rank Adaptation): Injects low-rank matrices into each layer. Instead of updating a weight matrix W directly, LoRA decomposes the update as ΔW = BA where B and A are low-rank matrices. This reduces trainable parameters by 10,000x while maintaining performance.
  • QLoRA: Combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of large models on a single consumer GPU.
  • Adapters: Small bottleneck modules inserted between existing layers. Only adapter parameters are trained; the rest stays frozen.
  • Prefix Tuning / Prompt Tuning: Prepend learnable vectors to the input or hidden states. Primarily used in NLP but conceptually applicable to any sequence model.
Tip: For the cobot scenario, LoRA is particularly attractive. You can maintain one base anomaly detection model and keep tiny per-brand LoRA adapters (a few MB each). Switching between brands is just swapping the adapter weights.

Fine-Tuning Code Example

Here is a complete example of fine-tuning a PyTorch model with layer freezing and discriminative learning rates for a time-series anomaly detection task:

import torch
import torch.nn as nn


class CobotAnomalyModel(nn.Module):
    """1D-CNN feature extractor + classifier for cobot anomaly detection."""

    def __init__(self, n_joints=6, n_features_per_joint=4, seq_len=200):
        super().__init__()
        in_channels = n_joints * n_features_per_joint  # 24 input channels

        # Feature extractor (transferable layers)
        self.features = nn.Sequential(
            nn.Conv1d(in_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1)
        )

        # Classifier head (task-specific)
        self.classifier = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 2)  # normal vs anomaly
        )

    def forward(self, x):
        # x shape: (batch, channels, seq_len)
        feat = self.features(x).squeeze(-1)
        return self.classifier(feat)


def fine_tune_for_new_brand(
    pretrained_model,
    target_loader,
    val_loader,
    freeze_features=True,
    base_lr=1e-3,
    n_epochs=30
):
    """Fine-tune a pre-trained cobot model for a new brand."""
    model = pretrained_model

    if freeze_features:
        # Strategy A: freeze feature extractor, train only classifier
        for param in model.features.parameters():
            param.requires_grad = False
        optimizer = torch.optim.Adam(
            model.classifier.parameters(), lr=base_lr
        )
    else:
        # Strategy C: discriminative learning rates
        param_groups = [
            {'params': model.features.parameters(), 'lr': base_lr * 0.1},
            {'params': model.classifier.parameters(), 'lr': base_lr},
        ]
        optimizer = torch.optim.Adam(param_groups)

    criterion = nn.CrossEntropyLoss()
    best_val_loss = float('inf')
    patience_counter = 0

    for epoch in range(n_epochs):
        model.train()
        for batch_x, batch_y in target_loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()

        # Validation and early stopping
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch_x, batch_y in val_loader:
                output = model(batch_x)
                val_loss += criterion(output, batch_y).item()

        val_loss /= len(val_loader)
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            patience_counter = 0
            torch.save(model.state_dict(), 'best_model.pt')
        else:
            patience_counter += 1
            if patience_counter >= 5:
                print(f"Early stopping at epoch {epoch}")
                break

    model.load_state_dict(torch.load('best_model.pt'))
    return model

Domain Adaptation — Bridging the Distribution Gap

While fine-tuning assumes you have at least some labeled data in the target domain, domain adaptation tackles a harder problem: what if you have plenty of labeled data in the source domain but no labels at all in the target domain? This is unsupervised domain adaptation (UDA), and it is the most common and challenging scenario in real-world deployments.

Formal Definition

In domain adaptation, the source and target domains share the same task (e.g., anomaly detection) but have different data distributions. Formally: PS(X) ≠ PT(X), but the labeling function is the same. The goal is to learn a model that performs well on the target distribution despite being trained primarily on the source distribution.

Several types of distribution shift can occur:

  • Covariate shift: P(X) changes but P(Y|X) stays the same. The input distributions differ but the relationship between inputs and outputs is preserved. This is the most common scenario for cobots — the sensor data distributions differ across brands, but the definition of “anomaly” remains consistent.
  • Label shift: P(Y) changes but P(X|Y) stays the same. The prior probability of classes changes. For example, one brand might have a 2% anomaly rate while another has 5%.
  • Concept drift: P(Y|X) changes — the same input means different things in different domains. This is rare for same-structure cobots but could occur if different brands define “normal operating range” differently.

Key Unsupervised Domain Adaptation Methods

Discrepancy-Based Methods

These methods explicitly measure and minimize the distance between source and target feature distributions.

Maximum Mean Discrepancy (MMD) measures the distance between two distributions by comparing their mean embeddings in a reproducing kernel Hilbert space (RKHS). If the mean embeddings are identical, the distributions are identical (for characteristic kernels). In practice, you add an MMD penalty to the training loss that encourages the network to produce similar feature distributions for source and target data.

CORAL (CORrelation ALignment) aligns the second-order statistics (covariance matrices) of source and target features. Deep CORAL integrates this alignment into the network by adding a CORAL loss at one or more hidden layers. The CORAL loss is simply the Frobenius norm of the difference between source and target covariance matrices.

Adversarial-Based Methods

These methods use an adversarial framework to learn domain-invariant features — features that are useful for the task but that a discriminator cannot use to distinguish between source and target domains.

Domain-Adversarial Neural Networks (DANN) are the flagship approach. The architecture has three components: a shared feature extractor, a task classifier (for anomaly detection), and a domain discriminator. The key innovation is the gradient reversal layer (GRL): during backpropagation, gradients from the domain discriminator are reversed before reaching the feature extractor. This means the feature extractor is trained to maximize the domain discriminator’s loss — i.e., to produce features that confuse the discriminator about which domain the data came from.

ADDA (Adversarial Discriminative Domain Adaptation) uses separate feature extractors for source and target, with the target extractor initialized from the source. The adversarial game is played between the target encoder and the discriminator.

CyCADA (Cycle-Consistent Adversarial Domain Adaptation) combines pixel-level adaptation (using CycleGAN-style image translation) with feature-level adaptation. While primarily used for visual tasks, the concept of cycle-consistent adaptation extends to other modalities.

Self-Training and Pseudo-Labeling

Self-training is a conceptually simple but surprisingly effective approach: train on labeled source data, generate predictions (pseudo-labels) on unlabeled target data, and retrain on the combined dataset. The key challenges are noise in pseudo-labels and confirmation bias. Modern approaches use confidence thresholding (only keep high-confidence pseudo-labels) and curriculum learning (start with the most confident predictions and gradually include less confident ones).

Optimal Transport Methods

Optimal transport provides a mathematically principled way to measure and minimize the distance between distributions using the Wasserstein distance. It finds the minimum “cost” of transforming one distribution into another and can be used to explicitly map source features to target features.

Advanced Domain Adaptation Scenarios

The standard UDA setup assumes one source and one target domain. Real-world scenarios are often more complex:

  • Multi-source domain adaptation: You have labeled data from multiple source domains (e.g., three cobot brands) and want to adapt to a new target brand. Methods like MDAN (Multi-source Domain Adversarial Networks) and M3SDA handle this by learning domain-specific and domain-shared features simultaneously.
  • Partial domain adaptation: The target domain has fewer classes than the source. For example, your source model detects 10 types of anomalies, but the target brand only experiences 6 of them. Standard UDA methods can perform poorly because they try to align classes that don’t exist in the target.
  • Open-set domain adaptation: The target domain contains classes not seen in the source. This is realistic for cobots — a new brand might exhibit failure modes not present in the training data. Methods must both adapt known classes and detect unknown target-specific anomalies.

Method Comparison

Method Mechanism Best When Complexity Performance
MMD Match kernel mean embeddings Small domain gap, clean data Low Good baseline
CORAL Align covariance matrices Linear shifts between domains Low Good for simple shifts
DANN Adversarial domain confusion Complex nonlinear shifts Medium Strong across scenarios
Self-Training Pseudo-label target data High-confidence predictions available Low Variable (depends on pseudo-label quality)
Optimal Transport Wasserstein distance minimization Strong theoretical guarantees needed High Strong but computationally expensive

 

DANN Implementation with Gradient Reversal Layer

Here is a complete PyTorch implementation of a Domain-Adversarial Neural Network:

import torch
import torch.nn as nn
from torch.autograd import Function


class GradientReversalFunction(Function):
    """Gradient Reversal Layer (GRL).

    Forward pass: identity function.
    Backward pass: negate gradients and scale by lambda.
    """
    @staticmethod
    def forward(ctx, x, lambda_val):
        ctx.lambda_val = lambda_val
        return x.clone()

    @staticmethod
    def backward(ctx, grad_output):
        return -ctx.lambda_val * grad_output, None


class GradientReversalLayer(nn.Module):
    def __init__(self, lambda_val=1.0):
        super().__init__()
        self.lambda_val = lambda_val

    def forward(self, x):
        return GradientReversalFunction.apply(x, self.lambda_val)


class DANN(nn.Module):
    """Domain-Adversarial Neural Network for time-series data."""

    def __init__(self, n_input_channels=24, n_classes=2, n_domains=2):
        super().__init__()

        # Shared feature extractor
        self.feature_extractor = nn.Sequential(
            nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),  # Global average pooling
        )

        # Task classifier (anomaly detection)
        self.task_classifier = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, n_classes),
        )

        # Domain discriminator
        self.domain_discriminator = nn.Sequential(
            GradientReversalLayer(lambda_val=1.0),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, n_domains),
        )

    def forward(self, x):
        features = self.feature_extractor(x).squeeze(-1)
        task_output = self.task_classifier(features)
        domain_output = self.domain_discriminator(features)
        return task_output, domain_output

    def set_lambda(self, lambda_val):
        """Update GRL lambda (schedule during training)."""
        for module in self.domain_discriminator.modules():
            if isinstance(module, GradientReversalLayer):
                module.lambda_val = lambda_val


def train_dann(model, source_loader, target_loader, n_epochs=50, device='cpu'):
    """Train DANN with progressive lambda scheduling."""
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    task_criterion = nn.CrossEntropyLoss()
    domain_criterion = nn.CrossEntropyLoss()

    model.to(device)

    for epoch in range(n_epochs):
        model.train()

        # Progressive lambda: 0 -> 1 over training
        p = epoch / n_epochs
        lambda_val = 2.0 / (1.0 + torch.exp(torch.tensor(-10.0 * p))) - 1.0
        model.set_lambda(lambda_val.item())

        # Iterate over both loaders simultaneously
        target_iter = iter(target_loader)

        for source_x, source_y in source_loader:
            try:
                target_x, _ = next(target_iter)
            except StopIteration:
                target_iter = iter(target_loader)
                target_x, _ = next(target_iter)

            source_x = source_x.to(device)
            source_y = source_y.to(device)
            target_x = target_x.to(device)

            # Source domain: label = 0
            source_task_out, source_domain_out = model(source_x)
            source_domain_labels = torch.zeros(
                source_x.size(0), dtype=torch.long, device=device
            )

            # Target domain: label = 1 (no task labels!)
            _, target_domain_out = model(target_x)
            target_domain_labels = torch.ones(
                target_x.size(0), dtype=torch.long, device=device
            )

            # Combined loss
            task_loss = task_criterion(source_task_out, source_y)
            domain_loss = domain_criterion(source_domain_out, source_domain_labels) \
                        + domain_criterion(target_domain_out, target_domain_labels)

            total_loss = task_loss + domain_loss

            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{n_epochs} | "
                  f"Task Loss: {task_loss.item():.4f} | "
                  f"Domain Loss: {domain_loss.item():.4f} | "
                  f"Lambda: {lambda_val.item():.4f}")
Key Takeaway: The gradient reversal layer is the heart of DANN. It makes the feature extractor learn representations that simultaneously minimize the task classification loss and maximize the domain classification loss. The result: features that are useful for anomaly detection but brand-agnostic.

The Cobot Anomaly Detection Scenario

Now let’s apply everything we’ve discussed to a concrete, industrially relevant problem. You manage a factory with multiple collaborative robots from different manufacturers — Universal Robots UR5e, FANUC CRX-10iA, ABB GoFa, KUKA LBR iiwa, and Doosan M1013. All are 6-axis or 7-axis articulated arms performing similar tasks. All generate sensor data: joint torques, positions, velocities, and motor currents.

You want one anomaly detection system that works across all brands, or at least a system that can be quickly adapted to a new brand without collecting thousands of labeled anomaly examples.

The challenge: despite sharing the same kinematic structure, each brand has fundamentally different data distributions due to:

  • Sensor characteristics: Different torque sensor resolutions, noise floors, and sampling rates (125 Hz vs 500 Hz vs 1 kHz)
  • Control systems: Different PID gains, trajectory planning algorithms, and jerk limits
  • Calibration: Different zero-point offsets, gear ratio tolerances, and friction models
  • Firmware: Different interpolation methods, filtering strategies, and data encoding

Let’s examine six strategies for tackling this, ranging from simple preprocessing to sophisticated neural domain adaptation.

Strategy 1: Domain-Invariant Feature Learning with DANN

This is the most principled approach. Using the DANN architecture from the previous section, we train on labeled data from one brand (say, UR5e — the most common cobot with the most available data) and use unlabeled data from other brands during training. The gradient reversal layer forces the feature extractor to learn representations that capture anomaly-relevant patterns while being invariant to brand-specific sensor characteristics.

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np


class CobotSensorDataset(Dataset):
    """Dataset for multi-joint cobot sensor data.

    Each sample: (n_joints * n_features, seq_len) tensor
    Features per joint: torque, position, velocity, current
    """
    def __init__(self, data, labels, domain_id):
        self.data = torch.FloatTensor(data)       # (N, channels, seq_len)
        self.labels = torch.LongTensor(labels)     # (N,) - 0=normal, 1=anomaly
        self.domain_id = domain_id

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx], self.domain_id


class CobotDANN(nn.Module):
    """DANN specifically designed for cobot anomaly detection.

    Input: multi-joint sensor data (6 joints x 4 features = 24 channels)
    Task: binary anomaly detection
    Domain: cobot brand identification (adversarial)
    """
    def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
        super().__init__()
        in_ch = n_joints * features_per_joint

        self.encoder = nn.Sequential(
            # Block 1: capture local temporal patterns
            nn.Conv1d(in_ch, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),

            # Block 2: capture mid-range dependencies
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.MaxPool1d(2),

            # Block 3: high-level features
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
        )

        self.anomaly_head = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 2),
        )

        self.domain_head = nn.Sequential(
            GradientReversalLayer(lambda_val=1.0),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, n_brands),
        )

    def forward(self, x):
        features = self.encoder(x).squeeze(-1)
        anomaly_pred = self.anomaly_head(features)
        domain_pred = self.domain_head(features)
        return anomaly_pred, domain_pred, features

    def predict_anomaly(self, x):
        """Inference: only anomaly prediction needed."""
        features = self.encoder(x).squeeze(-1)
        return self.anomaly_head(features)

Strategy 2: Multi-Source Domain Adaptation

When you have data from multiple brands, you can leverage all of them simultaneously. The key insight is to use domain-specific batch normalization: each brand gets its own BN layer to handle its unique distribution statistics, while all other weights are shared. This captures the intuition that different brands have different means and variances in their sensor data, but the learned features (convolution filters) should be universal.

class DomainSpecificBatchNorm(nn.Module):
    """Maintain separate BN statistics per domain (brand)."""

    def __init__(self, n_features, n_domains):
        super().__init__()
        self.bn_layers = nn.ModuleList([
            nn.BatchNorm1d(n_features) for _ in range(n_domains)
        ])
        self.n_domains = n_domains

    def forward(self, x, domain_id):
        if self.training:
            return self.bn_layers[domain_id](x)
        else:
            # At inference: use the specified domain's statistics
            return self.bn_layers[domain_id](x)

    def add_domain(self):
        """Add BN layer for a new brand — initialize from average of existing."""
        new_bn = nn.BatchNorm1d(self.bn_layers[0].num_features)

        # Initialize with average statistics across existing domains
        with torch.no_grad():
            avg_mean = torch.stack(
                [bn.running_mean for bn in self.bn_layers]
            ).mean(0)
            avg_var = torch.stack(
                [bn.running_var for bn in self.bn_layers]
            ).mean(0)
            new_bn.running_mean.copy_(avg_mean)
            new_bn.running_var.copy_(avg_var)

        self.bn_layers.append(new_bn)
        self.n_domains += 1


class MultiSourceCobotModel(nn.Module):
    """Multi-source model with domain-specific batch normalization."""

    def __init__(self, n_joints=6, features_per_joint=4, n_brands=5):
        super().__init__()
        in_ch = n_joints * features_per_joint

        self.conv1 = nn.Conv1d(in_ch, 64, kernel_size=7, padding=3)
        self.bn1 = DomainSpecificBatchNorm(64, n_brands)

        self.conv2 = nn.Conv1d(64, 128, kernel_size=5, padding=2)
        self.bn2 = DomainSpecificBatchNorm(128, n_brands)

        self.conv3 = nn.Conv1d(128, 256, kernel_size=3, padding=1)
        self.bn3 = DomainSpecificBatchNorm(256, n_brands)

        self.pool = nn.AdaptiveAvgPool1d(1)
        self.classifier = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 2),
        )

    def forward(self, x, domain_id=0):
        x = torch.relu(self.bn1(self.conv1(x), domain_id))
        x = torch.relu(self.bn2(self.conv2(x), domain_id))
        x = torch.relu(self.bn3(self.conv3(x), domain_id))
        x = self.pool(x).squeeze(-1)
        return self.classifier(x)
Tip: When a new brand arrives, call model.bn1.add_domain(), model.bn2.add_domain(), etc. Then run a few hundred unlabeled samples from the new brand through the model to calibrate the new BN statistics. No labeled data required for initial deployment.

Strategy 3: Fine-Tuning with Normalization Alignment

This is the pragmatist’s approach. Pre-train a full anomaly detection model on your best-labeled brand (e.g., UR5e with 50,000 labeled samples). When adapting to a new brand, freeze all convolutional and LSTM weights and only fine-tune the batch normalization layers and the final classifier head.

Why does this work? Because the kinematic structure is the same across brands. The convolutional filters that detect “sudden torque spike in joint 3” or “velocity reversal pattern” are fundamentally the same regardless of brand. What differs is the statistical distribution of the data — exactly what batch normalization captures.

def bn_only_fine_tune(pretrained_model, target_loader, n_epochs=10, lr=1e-3):
    """Fine-tune only BatchNorm layers + classifier for a new cobot brand.

    This is the fastest adaptation strategy: typically converges in
    5-10 epochs with as few as 100-500 labeled samples.
    """
    model = pretrained_model

    # Freeze everything
    for param in model.parameters():
        param.requires_grad = False

    # Unfreeze only BatchNorm parameters and classifier
    for module in model.modules():
        if isinstance(module, nn.BatchNorm1d):
            for param in module.parameters():
                param.requires_grad = True
            # Reset running statistics for the new domain
            module.reset_running_stats()

    for param in model.classifier.parameters():
        param.requires_grad = True

    # Collect trainable params
    trainable = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.Adam(trainable, lr=lr)
    criterion = nn.CrossEntropyLoss()

    print(f"Trainable parameters: {sum(p.numel() for p in trainable):,}")
    print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

    for epoch in range(n_epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for batch_x, batch_y in target_loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            predicted = output.argmax(dim=1)
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)

        acc = 100.0 * correct / total
        avg_loss = total_loss / len(target_loader)
        print(f"Epoch {epoch+1}/{n_epochs} | Loss: {avg_loss:.4f} | Acc: {acc:.1f}%")

    return model

Strategy 4: Contrastive Domain Adaptation

Contrastive learning provides a powerful alternative to adversarial approaches. The core idea: learn an embedding space where “normal” operation from any brand maps to similar representations, and “anomalous” patterns remain distinguishable regardless of which brand produced them.

We use a Supervised Contrastive (SupCon) loss that pulls together embeddings of the same class (normal/anomaly) regardless of brand, while pushing apart embeddings of different classes:

class SupConDomainLoss(nn.Module):
    """Supervised contrastive loss that ignores domain (brand) labels.

    Positive pairs: same anomaly class, any brand
    Negative pairs: different anomaly class, any brand

    This forces brand-invariant but anomaly-discriminative embeddings.
    """
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature

    def forward(self, features, labels):
        """
        Args:
            features: (batch_size, feature_dim) - L2-normalized embeddings
            labels: (batch_size,) - anomaly labels (0=normal, 1=anomaly)
        """
        device = features.device
        batch_size = features.shape[0]

        # Pairwise similarity matrix
        similarity = torch.matmul(features, features.T) / self.temperature

        # Mask: 1 where labels match (positive pairs), 0 otherwise
        labels = labels.unsqueeze(1)
        mask = torch.eq(labels, labels.T).float().to(device)

        # Remove self-similarity from mask
        self_mask = torch.eye(batch_size, device=device)
        mask = mask - self_mask

        # Numerical stability
        logits_max = similarity.max(dim=1, keepdim=True).values.detach()
        logits = similarity - logits_max

        # Denominator: all pairs except self
        exp_logits = torch.exp(logits) * (1 - self_mask)
        log_prob = logits - torch.log(exp_logits.sum(dim=1, keepdim=True) + 1e-8)

        # Average over positive pairs
        n_positives = mask.sum(dim=1)
        mean_log_prob = (mask * log_prob).sum(dim=1) / (n_positives + 1e-8)

        loss = -mean_log_prob[n_positives > 0].mean()
        return loss


class ContrastiveCobotModel(nn.Module):
    """Contrastive model for cross-brand cobot anomaly detection."""

    def __init__(self, n_input_channels=24, embed_dim=128):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Conv1d(n_input_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(1),
        )

        # Projection head for contrastive learning
        self.projector = nn.Sequential(
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, embed_dim),
        )

        # Classifier for anomaly detection
        self.classifier = nn.Linear(256, 2)

    def forward(self, x):
        features = self.encoder(x).squeeze(-1)
        projections = nn.functional.normalize(self.projector(features), dim=1)
        logits = self.classifier(features)
        return logits, projections

Strategy 5: Feature Normalization / Preprocessing Approach

Before reaching for neural domain adaptation, consider whether simple preprocessing can eliminate the distribution gap. This “boring” approach is often underrated and sometimes sufficient:

import numpy as np
from scipy.interpolate import interp1d


class CobotSignalNormalizer:
    """Normalize sensor signals to a common reference frame across brands.

    This preprocessing pipeline handles:
    1. Sampling rate alignment (resample to common rate)
    2. Per-joint Z-score normalization (per brand statistics)
    3. Torque residual computation (remove gravity/friction effects)
    4. Signal clipping for outlier robustness
    """

    def __init__(self, target_sample_rate=250, target_seq_len=200):
        self.target_sample_rate = target_sample_rate
        self.target_seq_len = target_seq_len
        self.brand_stats = {}  # {brand: {joint: {feature: (mean, std)}}}

    def fit_brand(self, brand_name, data):
        """Compute normalization statistics for a brand.

        Args:
            brand_name: str, e.g. 'ur5e'
            data: np.array of shape (n_samples, n_joints, n_features, seq_len)
        """
        n_samples, n_joints, n_features, seq_len = data.shape
        stats = {}
        for j in range(n_joints):
            stats[j] = {}
            for f in range(n_features):
                channel_data = data[:, j, f, :].flatten()
                stats[j][f] = (
                    float(np.mean(channel_data)),
                    float(np.std(channel_data)) + 1e-8
                )
        self.brand_stats[brand_name] = stats

    def normalize(self, data, brand_name, source_sample_rate):
        """Normalize a batch of sensor data from a specific brand.

        Args:
            data: np.array (n_samples, n_joints, n_features, seq_len)
            brand_name: str
            source_sample_rate: int, Hz

        Returns:
            Normalized data: np.array (n_samples, n_joints*n_features, target_seq_len)
        """
        n_samples, n_joints, n_features, seq_len = data.shape

        # Step 1: Resample to common rate
        if source_sample_rate != self.target_sample_rate:
            source_times = np.linspace(0, 1, seq_len)
            target_times = np.linspace(0, 1, self.target_seq_len)
            resampled = np.zeros(
                (n_samples, n_joints, n_features, self.target_seq_len)
            )
            for i in range(n_samples):
                for j in range(n_joints):
                    for f in range(n_features):
                        interpolator = interp1d(
                            source_times, data[i, j, f, :], kind='cubic'
                        )
                        resampled[i, j, f, :] = interpolator(target_times)
            data = resampled

        # Step 2: Z-score normalization per joint per feature
        stats = self.brand_stats[brand_name]
        normalized = np.zeros_like(data)
        for j in range(n_joints):
            for f in range(n_features):
                mean, std = stats[j][f]
                normalized[:, j, f, :] = (data[:, j, f, :] - mean) / std

        # Step 3: Clip to ±5 sigma for robustness
        normalized = np.clip(normalized, -5, 5)

        # Step 4: Reshape to (n_samples, channels, seq_len)
        n_samples = normalized.shape[0]
        seq_len = normalized.shape[-1]
        output = normalized.reshape(n_samples, n_joints * n_features, seq_len)

        return output

Strategy 6: Foundation Model Approach

The most forward-looking approach leverages the emerging ecosystem of time-series foundation models. The idea is to pre-train a large model on data from all available cobot brands in a self-supervised manner (e.g., masked time-series modeling), then fine-tune for anomaly detection with minimal labeled data from each brand.

This approach makes the most sense when you have access to massive amounts of unlabeled sensor data across many brands — which is increasingly common as cobot fleets grow. Models like Chronos (Amazon), TimesFM (Google), and Lag-Llama have shown that transformer-based architectures can learn transferable representations across diverse time-series domains.

class CobotFoundationModel(nn.Module):
    """Simplified foundation model for cobot sensor time-series.

    Pre-training task: masked sensor reconstruction
    Fine-tuning task: anomaly detection
    """
    def __init__(self, n_channels=24, d_model=256, n_heads=8,
                 n_layers=6, seq_len=200, mask_ratio=0.15):
        super().__init__()
        self.mask_ratio = mask_ratio

        # Patch embedding (treat each timestep as a "token")
        self.input_proj = nn.Linear(n_channels, d_model)
        self.pos_embedding = nn.Parameter(
            torch.randn(1, seq_len, d_model) * 0.02
        )

        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=0.1,
            batch_first=True,
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=n_layers
        )

        # Pre-training head: reconstruct masked timesteps
        self.reconstruction_head = nn.Linear(d_model, n_channels)

        # Fine-tuning head: anomaly classification
        self.anomaly_head = nn.Sequential(
            nn.Linear(d_model, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, 2),
        )

    def forward_pretrain(self, x):
        """Pre-training: masked reconstruction.

        x: (batch, n_channels, seq_len)
        """
        x = x.transpose(1, 2)  # (batch, seq_len, n_channels)
        batch_size, seq_len, _ = x.shape

        # Create random mask
        mask = torch.rand(batch_size, seq_len, device=x.device) < self.mask_ratio
        masked_x = x.clone()
        masked_x[mask] = 0.0

        # Encode
        h = self.input_proj(masked_x) + self.pos_embedding[:, :seq_len, :]
        h = self.transformer(h)

        # Reconstruct
        reconstruction = self.reconstruction_head(h)

        # Loss only on masked positions
        loss = nn.functional.mse_loss(
            reconstruction[mask], x[mask]
        )
        return loss

    def forward_anomaly(self, x):
        """Fine-tuning / inference: anomaly detection.

        x: (batch, n_channels, seq_len)
        """
        x = x.transpose(1, 2)
        h = self.input_proj(x) + self.pos_embedding[:, :x.size(1), :]
        h = self.transformer(h)

        # Global average pooling across time
        h_pooled = h.mean(dim=1)
        return self.anomaly_head(h_pooled)

Strategy Comparison and Recommendation

Strategy Labeled Data Needed Complexity Adaptation Speed Expected Performance
1. DANN Source only Medium-High Slow (retrain) High
2. Multi-Source BN Multiple sources Medium Fast (BN calibration only) High
3. BN Fine-Tuning 100-500 target samples Low Very fast (minutes) Good
4. Contrastive Source + some target Medium-High Moderate High
5. Normalization None (unsupervised stats) Very Low Instant Moderate
6. Foundation Model Minimal per brand Very High Fast (once pre-trained) Highest (with scale)

 

Key Takeaway — Recommended Pipeline: Start with Strategy 5 (normalization) + Strategy 3 (BN fine-tuning) as your baseline. This combination is fast to implement, requires minimal labeled data, and handles the most common sources of cross-brand distribution shift. If performance is insufficient, escalate to Strategy 1 (DANN) or Strategy 2 (Multi-Source BN). Reserve Strategy 6 (Foundation Model) for organizations with large-scale multi-brand data and the compute budget to match.

Practical Implementation Guide

Data Collection for Cobots

The quality of your domain adaptation depends entirely on the quality of your data. For multi-brand cobot anomaly detection, consider the following:

Sensor selection: At minimum, collect per-joint torque, position, velocity, and motor current. These four signals per joint provide a comprehensive view of the robot's mechanical state. For a 6-axis cobot, that's 24 sensor channels.

Sampling rate: Different brands sample at different rates (UR5e at 500 Hz, FANUC at 250 Hz, KUKA at 1 kHz). Either resample to a common rate or use architectures that handle variable-length inputs.

Labeling strategy: Labeling anomalies requires domain expertise. A practical approach is to label by operational segment (one pick-and-place cycle) rather than by individual timestep. Use a three-tier scheme: normal, anomalous, and uncertain. Only train on the first two.

Data volume guidelines: For the source brand, aim for at least 10,000 labeled segments (with at least 500 anomalies). For target brands, even 100-500 labeled segments enable effective fine-tuning if you use Strategy 3 or 5.

Feature Engineering for Multi-Joint Cobots

Raw sensor signals can be enhanced with engineered features that capture domain-relevant physics:

  • Joint torque residuals: The difference between measured torque and expected torque from the robot's dynamic model. This removes the "normal" torque component (gravity, inertia, friction) and isolates anomalous forces.
  • Energy consumption profiles: Power = torque × velocity per joint. Anomalies often manifest as unexpected energy consumption patterns before they appear in raw signals.
  • Vibration spectra: FFT of accelerometer or high-frequency torque data. Bearing degradation, gear wear, and loose bolts each have distinctive frequency signatures.
  • Kinematic error metrics: Difference between commanded and actual trajectory. Increasing tracking error often precedes mechanical failure.

Model Architecture Choices

Architecture Strengths Weaknesses Best For
1D-CNN Fast, local pattern detection Limited long-range dependencies Short anomaly patterns, real-time edge
LSTM/GRU Sequential memory, temporal context Slow training, vanishing gradients Long-term degradation patterns
LSTM-AutoEncoder Unsupervised, reconstruction-based Threshold tuning, slower inference Minimal labels, novelty detection
Transformer Global attention, parallelizable Data-hungry, quadratic complexity Large datasets, complex multi-joint patterns
CNN-LSTM Hybrid Best of both: local + temporal More hyperparameters General-purpose (recommended)

 

For the cobot scenario, the CNN-LSTM hybrid is typically the best starting point. Here's a complete implementation with domain adaptation support:

class CobotCNNLSTMAutoEncoder(nn.Module):
    """CNN-LSTM AutoEncoder with domain adaptation for cobot anomaly detection.

    Architecture:
    - CNN encoder: extracts local temporal features
    - LSTM: captures sequential dependencies
    - CNN decoder: reconstructs input signal
    - Domain discriminator (optional): for DANN-style adaptation

    Anomaly score: reconstruction error (MSE)
    """
    def __init__(self, n_channels=24, hidden_dim=128, lstm_layers=2,
                 n_domains=None):
        super().__init__()

        # --- Encoder ---
        self.conv_encoder = nn.Sequential(
            nn.Conv1d(n_channels, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2),
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.MaxPool1d(2),
        )

        self.lstm_encoder = nn.LSTM(
            input_size=128,
            hidden_size=hidden_dim,
            num_layers=lstm_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.2,
        )

        # Bottleneck
        self.bottleneck = nn.Linear(hidden_dim * 2, hidden_dim)

        # --- Decoder ---
        self.lstm_decoder = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=lstm_layers,
            batch_first=True,
            dropout=0.2,
        )

        self.conv_decoder = nn.Sequential(
            nn.Upsample(scale_factor=2),
            nn.Conv1d(hidden_dim, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Upsample(scale_factor=2),
            nn.Conv1d(128, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Conv1d(64, n_channels, kernel_size=3, padding=1),
        )

        # Optional domain discriminator
        self.domain_discriminator = None
        if n_domains is not None:
            self.domain_discriminator = nn.Sequential(
                GradientReversalLayer(lambda_val=1.0),
                nn.Linear(hidden_dim, 64),
                nn.ReLU(),
                nn.Linear(64, n_domains),
            )

    def encode(self, x):
        """Encode input to latent representation.

        x: (batch, n_channels, seq_len)
        """
        # CNN encoding
        conv_out = self.conv_encoder(x)  # (batch, 128, seq_len//4)

        # LSTM encoding
        conv_out = conv_out.transpose(1, 2)  # (batch, seq_len//4, 128)
        lstm_out, _ = self.lstm_encoder(conv_out)  # (batch, seq_len//4, 256)

        # Take last timestep as global representation
        global_repr = lstm_out[:, -1, :]  # (batch, 256)
        latent = self.bottleneck(global_repr)  # (batch, hidden_dim)

        return latent, conv_out.shape[1]  # return seq_len for decoder

    def decode(self, latent, target_seq_len):
        """Decode latent representation back to signal.

        latent: (batch, hidden_dim)
        """
        # Repeat latent for each timestep
        repeated = latent.unsqueeze(1).repeat(1, target_seq_len, 1)

        # LSTM decoding
        lstm_out, _ = self.lstm_decoder(repeated)  # (batch, seq_len, hidden_dim)

        # CNN decoding
        lstm_out = lstm_out.transpose(1, 2)  # (batch, hidden_dim, seq_len)
        reconstruction = self.conv_decoder(lstm_out)

        return reconstruction

    def forward(self, x):
        latent, seq_len = self.encode(x)
        reconstruction = self.decode(latent, seq_len)

        # Ensure reconstruction matches input size
        if reconstruction.size(2) != x.size(2):
            reconstruction = nn.functional.interpolate(
                reconstruction, size=x.size(2), mode='linear',
                align_corners=False
            )

        domain_pred = None
        if self.domain_discriminator is not None:
            domain_pred = self.domain_discriminator(latent)

        return reconstruction, domain_pred, latent

    def anomaly_score(self, x):
        """Compute per-sample anomaly score (reconstruction error)."""
        reconstruction, _, _ = self.forward(x)
        # MSE per sample
        mse = ((x - reconstruction) ** 2).mean(dim=(1, 2))
        return mse


def train_cobot_autoencoder(model, source_loader, target_loader=None,
                            n_epochs=100, device='cpu'):
    """Train the CNN-LSTM AutoEncoder with optional domain adaptation."""
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, n_epochs)

    model.to(device)

    for epoch in range(n_epochs):
        model.train()
        total_recon_loss = 0
        total_domain_loss = 0

        target_iter = iter(target_loader) if target_loader else None

        for batch_x, _, _ in source_loader:
            batch_x = batch_x.to(device)

            reconstruction, domain_pred, _ = model(batch_x)

            # Match sizes if needed
            if reconstruction.size(2) != batch_x.size(2):
                reconstruction = nn.functional.interpolate(
                    reconstruction, size=batch_x.size(2),
                    mode='linear', align_corners=False
                )

            recon_loss = nn.functional.mse_loss(reconstruction, batch_x)
            total_loss = recon_loss

            # Domain adaptation loss (if target data available)
            if target_iter is not None and domain_pred is not None:
                try:
                    target_x, _, _ = next(target_iter)
                except StopIteration:
                    target_iter = iter(target_loader)
                    target_x, _, _ = next(target_iter)

                target_x = target_x.to(device)
                _, target_domain_pred, _ = model(target_x)

                source_domain_labels = torch.zeros(
                    batch_x.size(0), dtype=torch.long, device=device
                )
                target_domain_labels = torch.ones(
                    target_x.size(0), dtype=torch.long, device=device
                )

                domain_loss = (
                    nn.functional.cross_entropy(domain_pred, source_domain_labels)
                    + nn.functional.cross_entropy(target_domain_pred, target_domain_labels)
                )
                total_loss += 0.1 * domain_loss
                total_domain_loss += domain_loss.item()

            optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_recon_loss += recon_loss.item()

        scheduler.step()

        if (epoch + 1) % 10 == 0:
            avg_recon = total_recon_loss / len(source_loader)
            msg = f"Epoch {epoch+1}/{n_epochs} | Recon: {avg_recon:.6f}"
            if target_loader:
                avg_domain = total_domain_loss / len(source_loader)
                msg += f" | Domain: {avg_domain:.4f}"
            print(msg)

    return model

Evaluation Metrics

For production cobot anomaly detection, standard accuracy is meaningless — the class imbalance (often 99% normal, 1% anomaly) makes it trivial to achieve high accuracy by predicting "normal" always. Use these metrics instead:

  • AUROC (Area Under ROC Curve): The primary metric. Measures the model's ability to rank anomalous samples higher than normal samples regardless of threshold. Aim for > 0.95.
  • F1 Score: The harmonic mean of precision and recall at the optimal threshold. Aim for > 0.85.
  • Precision@k: If you flag the top-k most anomalous samples, what fraction are true anomalies? Critical for maintenance teams who can only investigate a limited number of alerts per shift.
  • False Positive Rate (FPR): Perhaps the most critical metric in production. Each false positive triggers an unnecessary investigation, reducing trust in the system. Target FPR < 1% at your operating threshold.
Caution: When evaluating domain adaptation, always measure performance on the target domain separately. A model with 0.98 AUROC averaged across all brands might still have 0.85 AUROC on the newest brand — and that is the one you actually need to work.

Deployment Considerations

Edge vs. Cloud: Cobot anomaly detection often needs to run at the edge — directly on the robot controller or a nearby industrial PC. This constrains model size and inference latency. A CNN-based model with ~500K parameters can run inference in under 5ms on an NVIDIA Jetson. The full CNN-LSTM AutoEncoder (~2M parameters) needs about 20ms. Transformer models may require cloud deployment.

Inference latency requirements: For real-time safety-critical detection (e.g., collision avoidance), you need sub-10ms inference. For predictive maintenance (detecting degradation patterns), latency of 100ms–1s is acceptable since you're analyzing trends over minutes or hours.

Model update strategy: Domain drift happens — sensors degrade, firmware updates change data characteristics, and new operating conditions emerge. Plan for periodic re-calibration of BN statistics (weekly) and full fine-tuning (monthly) to maintain performance. Use monitoring to trigger updates: if anomaly score distributions shift significantly on data you know is normal, the model needs recalibration.

Conclusion

Transfer learning is not a single technique — it is a paradigm that encompasses fine-tuning, domain adaptation, feature extraction, and more. Understanding this hierarchy is the first step toward applying it effectively. Fine-tuning adapts a pre-trained model to new data through continued training. Domain adaptation bridges distribution gaps between source and target domains, even without target labels.

For heterogeneous cobot fleets, these techniques are not academic luxuries — they are operational necessities. The alternative is training separate models for every brand, every firmware version, and every operational context. That path leads to an unmaintainable jungle of models, each demanding its own labeled dataset.

The practical pipeline we recommend starts simple: normalize your sensor data across brands (Strategy 5) and fine-tune only the batch normalization layers (Strategy 3). This baseline requires minimal labeled data and can be deployed in hours. If performance falls short — particularly on brands with unusual sensor characteristics — escalate to adversarial domain adaptation (Strategy 1 with DANN) or contrastive methods (Strategy 4). For organizations building long-term cobot intelligence platforms, investing in a foundation model (Strategy 6) will yield compounding returns as the fleet grows.

The code examples throughout this post are complete and runnable. They are not production-ready — you'll need to add proper data loading, logging, checkpointing, and monitoring — but they provide the architectural foundation for any of the six strategies we discussed. The hardest part of cross-brand cobot anomaly detection is not the algorithm; it is collecting representative data and establishing a labeling protocol that domain experts can follow consistently.

As collaborative robots become as common as industrial PCs on the factory floor, the ability to transfer anomaly detection intelligence across brands will separate the organizations that scale their automation from those that drown in model maintenance. Transfer learning, fine-tuning, and domain adaptation are the tools that make that scaling possible.

References

  1. Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359.
  2. Ganin, Y., et al. (2016). Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research, 17(1), 2096-2030.
  3. Sun, B., & Saenko, K. (2016). Deep CORAL: Correlation Alignment for Deep Domain Adaptation. ECCV Workshops.
  4. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL 2018.
  5. Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
  6. Ansari, A. F., et al. (2024). Chronos: Learning the Language of Time Series. arXiv preprint arXiv:2403.07815.
  7. Long, M., et al. (2015). Learning Transferable Features with Deep Adaptation Networks. ICML 2015.
  8. Tzeng, E., et al. (2017). Adversarial Discriminative Domain Adaptation. CVPR 2017.
  9. Khosla, P., et al. (2020). Supervised Contrastive Learning. NeurIPS 2020.
  10. Li, Y., et al. (2017). Revisiting Batch Normalization For Practical Domain Adaptation. ICLR Workshop 2017.
  11. Zhao, H., et al. (2018). Adversarial Multiple Source Domain Adaptation. NeurIPS 2018.
  12. Courty, N., et al. (2017). Optimal Transport for Domain Adaptation. IEEE TPAMI, 39(9), 1853-1865.
  13. Das, A., et al. (2024). A Foundation Model for Time Series Analysis. arXiv preprint arXiv:2310.10688 (TimesFM).
  14. ISO/TS 15066:2016. Robots and robotic devices — Collaborative robots. International Organization for Standardization.

Disclaimer: This article is for informational and educational purposes only. Any code examples are provided as-is and should be thoroughly tested and validated before use in production environments, especially in safety-critical robotics applications. Always follow your organization's safety protocols and applicable ISO standards when deploying anomaly detection systems on collaborative robots.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *