RA-L 2026

DockAnywhere: Data-Efficient Visuomotor
Policy Learning for Mobile Manipulation
via Novel Demonstration Generation

Ziyu Shan¹, Yuheng Zhou¹, Gaoyuan Wu¹, Ziheng Ji¹, Zhenyu Wu², Ziwei Wang¹

¹ School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
² Beijing University of Posts and Telecommunications, Beijing, China

Paper Code Results

Augmentation Latency

0.1s

Per demonstration, enabling practical large-scale augmentation

ManiSkill Avg. Success

78.9%

Across 4 tasks with only 5 source demonstrations

Data Augmentation for

Mobile Manipulation

Point cloud editing · Spatial transform · Skill reuse

Abstract

Generalizing manipulation skills
across arbitrary docking positions

Mobile manipulation is a fundamental capability that enables robots to interact in expansive environments such as homes and factories. Most existing approaches follow a two-stage paradigm, where the robot first navigates to a docking point and then performs fixed-base manipulation using powerful visuomotor policies. However, real-world mobile manipulation often suffers from the view generalization problem due to shifts of docking points. To address this issue, we propose a novel low-cost demonstration generation framework named DockAnywhere, which improves viewpoint generalization under docking variability by lifting a single demonstration to diverse feasible docking configurations. Specifically, DockAnywhere lifts a trajectory to any feasible docking points by decoupling docking-dependent base motions from contact-rich manipulation skills that remain invariant across viewpoints. Feasible docking proposals are sampled under feasibility constraints, and corresponding trajectories are generated via structure-preserving augmentation. Visual observations are synthesized in 3D space by representing the robot and objects as point clouds and applying point-level spatial editing to ensure the consistency of observation and action across viewpoints. Extensive experiments on ManiSkill and real-world platforms demonstrate that DockAnywhere substantially improves policy success rates and easily generalizes to novel viewpoints from unseen docking points during training, significantly enhancing the generalization capability of mobile manipulation policy in real-world deployment.

Geometry-first augmentation

Trajectory reuse via rigid 3D spatial transforms—no policy rollouts or neural rendering at augmentation time.

Point-cloud observation synthesis

Scene point clouds are edited geometrically to match the target docking viewpoint, preserving full 3D structure.

Practical scale

0.1-second per-demo latency allows thousands of augmented trajectories from a handful of human demonstrations.

Method

Four-stage pipeline

DockAnywhere transforms a source trajectory into target-docking-aware demonstrations through four sequential stages.

DockAnywhere pipeline: source trajectory, docking point proposals, spatial enhancement, trajectory generation

Pipeline overview. (1) The source trajectory and scene are segmented into manipulation range and motion range. (2) A VLM scores candidate docking points for feasibility against the target object. (3) Spatial transforms (Δx, Δy, Δθ) align the trajectory to the new docking pose. (4) Motion replanning and skill reuse generate the final augmented trajectory with synthesized point-cloud observations.

Stage 01

Trajectory Parsing

Each source demonstration is decomposed into a motion segment—the mobile base navigating to the docking point—and a skill segment—the arm executing the manipulation. This split enables independent transformation of the two segments under different spatial constraints.

Stage 02

TAMP-Based Docking Proposals

Candidate docking points are sampled along a feasible arc around the target object. A vision-language model (VLM) evaluates each candidate for collision freedom and reachability, filtering to a set of valid target docking configurations without any physical rollouts.

Stage 03 & 04

Spatial Transform & Observation Synthesis

A rigid transform (Δx, Δy, Δθ) maps the source skill trajectory to the target docking frame. The motion segment is replanned via standard navigation. Scene point clouds are edited geometrically—translated and rotated—to match the new egocentric viewpoint, producing a complete augmented demonstration with consistent 3D observations.

Results

State-of-the-art on ManiSkill &
real-world generalization

DockAnywhere is evaluated on four ManiSkill manipulation tasks under 1-demo and 5-demo source settings.

78.9%

Avg. success (5-demo)
vs. 64.0% prior best

+14.9pp

Improvement over
best baseline (5-demo)

58.6%

Avg. success (1-demo)
strong single-shot perf.

0.1s

Augmentation latency
per demonstration

ManiSkill Benchmark Comparison

Success rate (%) across five tasks. # Demos = number of augmented demonstrations used.

# Demos	Method	Pick Banana	Pick Mug	Place Can	Cabinet Door	Cabinet Drawer	Avg.
Demo #1
1	DP	95.0	82.0	76.0	68.0	72.0	78.6
1	DP3	100.0	88.0	90.0	81.0	84.0	88.6
Demo #5
5	DP	19.0	16.4	15.6	13.6	14.4	15.8
5	DP3	20.0	17.7	18.8	16.2	16.4	17.8
5	DP3+DemoGen	98.0	88.6	84.4	48.2	52.0	74.2
5	DockAnywhere (Ours)	97.0	89.4	87.2	60.2	60.6	78.9

Real-World Generalization

Real-world DockAnywhere: source demonstration on Galaxea R1, after docking transition, after rotation

Real-robot transfer on Galaxea R1. Given a source demonstration (left), DockAnywhere automatically computes the spatial transform for a new docking position (center) and applies rotation augmentation (right). The policy trained on augmented data successfully executes the gear assembly task from unseen docking configurations without additional human demonstrations.

Citation

BibTeX

If you find DockAnywhere useful in your research, please cite our paper.

@article{shan2026dockanywhere,
  title={DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation},
  author={Shan, Ziyu and Zhou, Yuheng and Wu, Gaoyuan and Ji, Ziheng and Wu, Zhenyu and Wang, Ziwei},
  journal={IEEE Robotics and Automation Letters},
  year={2026},
  publisher={IEEE}
}

DockAnywhere: Data-Efficient Visuomotor Policy Learning for Mobile Manipulation via Novel Demonstration Generation

Generalizing manipulation skillsacross arbitrary docking positions