My first week of imitation learning: ACT and Diffusion Policy

I just started as an intern at Dream Machines, a startup allowing non-technical people to automate repetitive work in factories by teaching robots through demonstration. A very exciting opportunity for me to learn a lot more about robot learning, a field which is still relatively new to me. I already know about reinforcement learning (RL) from courses I taught and from my bachelor’s thesis, where I developed an RL-based locomotion controller for the Serenity Robot. But I have never worked on manipulation tasks, never worked with imitation learning algorithms or Vision-Language-Action (VLA) models, and have only limited experience with policy deployment on real robots.

In this post I want to share how I spent my first few days of the internship getting into robot learning for manipulation and my experience training my first baseline imitation learning policies to stack cubes.

Getting warmed up

Robot learning combines machine learning and robotics to help robots learn new skills. As of right now (spring 2026), a lot of research on learning manipulation skills is done on multi-modal foundation models like VLAs and on fine-tuning policies with RL. Before looking at the absolute newest state-of-the-art research, I was advised to start with generally simpler imitation learning algorithms and build up my knowledge from there. Here are the first three papers I read:

Zhao et al. (2023), Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. The Action Chunking Transformer (ACT) policy predicts actions in chunks instead of autoregressively one at a time.
Chi et al. (2023), Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. The Diffusion Policy uses the same denoising idea as image-generation diffusion models, but to predict action sequences instead of pixels.
Black et al. (2024), π₀: A Vision-Language-Action Flow Model for General Robot Control. The first generalist policy published by Physical Intelligence.

Thanks to open-source platforms like LeRobot with implementations of these algorithms, you technically don’t even need to understand them in detail to use them. I still highly recommend at least informing yourself about the concepts, so that it doesn’t just feel like a black box you are working with. Another great resource I read is the Unfolding Robotics blog post from LeRobot, explaining the full process of training a policy to fold clothes.

I now felt ready to test out these algorithms myself. The goal of the next few days was to train baseline imitation learning policies on a simple task and get a feel for their abilities and limitations. The task I chose was stacking one cube on top of another with just a single robot arm. Fairly simple. I wanted to know:

Can ACT and Diffusion Policy generalise across cube placement?
Does adding wider-placement training data help performance?
How do ACT and Diffusion Policy compare on the same task?

Setup

The robot arms we use are two TRLC-DK1 arms from The Robot Learning Company with one camera per wrist and a context camera. Computation for training and inference is done on a local workstation with an NVIDIA RTX 5090.

Collecting Data

Time to leave the screen and touch some actual hardware. Following the advice from LeRobot’s blog, I started by teleoperating the two arms in our office to get a feel for them. I was quite clumsy in the beginning but after about ten minutes of lifting and placing different objects, my hands learned the mapping from the leader to the follower arms and it felt intuitive.

To collect data I would be using for training I tried to follow these tips I gathered from different sources:

Practice the task before recording. Imitation learning policies clone behaviour. Your demonstrations should therefore be clean.
Commit to one strategy. Even though ACT and Diffusion Policy can handle it if you solve the task with different strategies (e.g. grabbing from the top or from the side), they learn more efficiently with a single strategy.
Give the robot the needed context. Make sure the robot gets enough information from its sensors to actually be able to solve the task. Don’t block the cameras for example.

For my experiment I prepared three training datasets, with different distributions of cube placement:

name	demos	placement
`narrow`	100	cubes within a ~1 cube-side radius around fixed locations
`wide`	100	cubes within a ~2 cube-side radius
`combined`	200	union of `narrow` and `wide`

Narrow placement distribution — Cube starting positions across 5 trials. Left: narrow placement. Right: wide placement.

Wide placement distribution — Cube starting positions across 5 trials. Left: narrow placement. Right: wide placement.

ACT policy v1: It kind of works

I trained a first ACT policy on the 100 narrow demonstrations, with the default parameters we had in our codebase:

parameter	value
`chunk_size`	48
`n_action_steps`	30
`lr`	3e-5
`batch_size`	16
`steps`	10k

Honestly, I had low expectations. After 100 demonstrations and only 15 minutes of training, I expected nothing more than random jittery movements. Instead, the intent was clearly there: grasp the right cube, move to the left one, drop. Still lots of misses and somewhat jerky movements, but hey, that’s a start!

Initial act_v1 policy on the narrow dataset. Visibly rough, but the intent is clearly there.

ACT policy v2 and v3: Tony’s tips

The default parameters I used were clearly not optimal. I was pointed to Tony Zhao’s tuning tips for ACT which inspired the new parameter choices. Three takeaways I acted on:

chunk_size = n_action_steps. I had defaults of 48 and 30, which means the policy generates 48 actions per inference but only executes 30 before re-planning. Based on Tony’s tip, for act_v2 I set both to 50 (one second at our 50 Hz control rate).
Larger batch helps. With a batch size of 16, only ~25% VRAM was utilised on the GPU. For act_v3 I bumped batch size from 16 to 64 (now hitting ~86% VRAM) and adjusted the learning rate from 3e-5 to 5e-5.
Longer training. For act_v2 and act_v3 I increased the number of steps to 20k. Since intermediate checkpoints are saved, I could also run policies with fewer training steps if it turned out the policy overfits on the dataset.

In the end I had three ACT configurations:

config	`chunk_size` / `n_action_steps`	`batch_size`	`lr`	`steps`
`act_v1`	48 / 30	16	3e-5	10k
`act_v2`	50 / 50	16	3e-5	20k
`act_v3`	50 / 50	64	5e-5	20k

ACT training loss across the three datasets — Training L1 loss for the `act_v3` configuration on the three datasets.

Evaluation round 1: Too much at once

Once my policies had trained overnight, I needed to actually compare them. I was inspired by the scoring scheme from the π₀ paper: a task-progress score from 0.0 (failure) up to 1.0 (cubes stacked), with increments per subtask.

stage	score
1 failure	0.0
2 reach right cube	0.2
3 pick up cube	0.4
4 reach left cube	0.6
5 release cube	0.8
6 cubes stacked	1.0

In a first evaluation run I wanted to compare three things:

Number of training steps: Go through all ten checkpoints of act_v2 trained on the narrow dataset, from 2k to 20k steps.
Batch size: Compare act_v2 and act_v3 for the three different datasets.
Training dataset: Compare the policies trained on the three different datasets for configurations act_v2 and act_v3.

For each evaluation, half of the episodes would have narrow cube placements and the other half wide cube placements.

I completely underestimated the effort it takes to do proper evaluation. After all, it has to be done on real hardware and can’t be automated in simulation. I made the poor decision to still run all of the planned evaluations, but with only 10 episodes each. This led to very noisy results that make it hard to draw conclusions.

Training steps

Score distribution across act_v2 checkpoints on the narrow dataset — Score distribution of `act_v2` trained on the narrow dataset, across 10 checkpoints from 2k to 20k steps with 10 episodes per checkpoint.

Three interesting findings:

The subtask the robot struggles with the most through all the checkpoints is actually picking up the cube. The subtask most directly tied to the different cube placements.
Already after 6k training steps, the performance on paper is comparable to all the later checkpoints.
Performance degrades at the 20k checkpoint. The first checkpoint where some episodes score 0.0.

Even though the scores are similar throughout, qualitatively the robot’s motion got smoother with more training steps. Interestingly, the 20k checkpoint policy sometimes just got stuck in the initial position. This shows up as the episodes scoring 0.0. It is possible that the training data contained too many samples from the beginning of the episode, where the arm had not yet moved, and the policy overfit to these. To investigate this hypothesis, further experiments would be needed.

To continue with the evaluations i decided to use the 12k checkpoint for further comparisons.

Batch size and training datasets

Score distribution by dataset at 12k steps for act_v2 and act_v3 — Score distribution at the 12k-step checkpoint, comparing `act_v2` (left) and `act_v3` (right) across the three training datasets with 10 episodes per cell.

Two findings:

For configuration act_v3, the policy already gets stuck at the 12k checkpoint when trained with the combined dataset.
The policy trained only on the wide dataset performs worse than the other policies (apart from the combined/act_v3 run that got stuck).

With so little evaluation data it’s not possible to draw a conclusion about differences between act_v2 and act_v3. Observing scores of 0.0 for the policy trained with a higher batch size could still point to overfitting. Comparing the policies trained on the different datasets suggests that training on the wide dataset alone leads to worse performance than the other two datasets. This could just be because the wider distribution increases the difficulty, and the other two policies, which were (also) trained on narrow data, can compensate with high scores on the narrowly placed cubes during evaluation.

Evaluation round 2

For the second round I wanted to run a single, more conclusive comparison. I bumped the evaluation to 50 episodes per setting and only compared the checkpoint at 10k steps with the one at 20k steps for the act_v3 configuration trained on the combined dataset. Again, half of the evaluation episodes consisted of cubes placed in the narrow distribution and the other half in the wide distribution.

Score distribution at the 10k and 20k checkpoints, narrow vs wide positions — Score distribution across the 25 trials per cell, comparing the 10k checkpoint to the 20k checkpoint of `act_v3` on the `combined` dataset. Each coloured band is the share of trials that reached a given task-progress score.

The same configuration at 20k training steps achieves a much better score. On the narrow cube positions the policy achieves 84% strict success, and never fails earlier than the release stage (score 0.8). On the wide cube placements it succeeds 32% of the time. This result suggests that performance increases with more training steps and potentially even better performance could be achieved by training my policy for more than the 20k steps. This may seem to contradict round 1, where performance degraded at 20k, but the setting was slightly different (act_v2 on narrow). Interestingly, not a single episode achieved a score of 0.0.

act_v3 trained on the combined dataset, after 20k training steps.

What about Diffusion Policy?

For Diffusion Policy I used the configuration from the original paper’s “Real Pour / Spread / Mug Flip” tasks: DDIM with 16 inference steps, horizon=16, n_action_steps=8. I trained three policies, one per dataset, for 20k steps.

Diffusion Policy training loss across the three datasets — Training loss for the three Diffusion Policy runs.

None of the three policies had comparable performance to the ACT policies. All scores were approximately 0.30, and qualitatively the motion was visibly jerky, with the gripper sometimes hovering and backing off near the first cube before committing to a grasp.

I spent some time debugging and found that inference took almost 35 ms. But even after bringing down the inference time by doing fewer denoising steps, and giving the policy more time budget by running at 30 Hz instead of 50 Hz, I wasn’t able to improve the performance. In the videos below you can see the policy with different settings.

Left to right: 50 fps / 16 inference steps (baseline); 30 fps / 16 steps; 50 fps / 8 steps.

I sadly had to move on without seeing a working Diffusion Policy. Maybe some other time.

Learnings and Tips

Looking back at this week, there are a couple of things I would tell myself to watch out for before starting.

Start with behaviour cloning. I found ACT particularly rewarding, because it gives you results very quickly, which is motivating.
Train your first policy on an easy-to-teleoperate task. Being able to solve the task easily leads to high-quality demonstrations, speeds up data collection and prevents frustration.
Play around before doing proper experiments. Train a first policy before thinking about exactly what to analyse. By just observing the policy and trying to understand how it works and why it does some weird things, questions arise on their own and are probably more interesting than the ones you come up with right after reading.
Analyse one thing at a time. Don’t try to do everything at once; focus on one small thing to analyse. Plan out the experiment in detail and keep in mind that evaluations are time-consuming.

Here are some experiments I think would be interesting:

How important are the different camera angles? You could collect a single dataset and train multiple policies, each leaving out the images from a different camera, and compare their performance. If you choose a single-arm task, you could have the other arm follow along to provide an additional camera perspective. (In this blog you can see how the arms always follow each other.)
How do behaviour-cloning policies handle multi-modality? I think it would be interesting to solve a task with two different possible movements and observe how the policy handles it. Will it commit to one, or treat the two modes as equally likely?
How much can you boost a policy’s performance with HG-DAgger? If your setup allows for human-intervention data collection, you could do multiple iterations of training a policy and collecting intervention data, stepping in to help the policy reach the goal.
How long should a policy be trained? The results from my 2nd evaluation suggest that training the policy for more training steps increases the performance. It would be interesting to see when the performance plateaus or if it even degrades due to overfitting.
How do different lighting and background influence a policy? At the end of my week, we moved the desk with the robot arms and I felt like the performance of the policy dropped quite a bit. This would be interesting to examine more analytically.
Can you increase performance by weighting some sections of the task differently? Every task has sections that are more complex than others. Weighting the complex sections more (e.g. collecting additional data for them) could improve the policy’s performance. Removing the stale states at the beginning (and end) of the episode may also prevent the policy from getting stuck in the rest position.

Wrap-up

Going back to the three questions I started with:

Can ACT and Diffusion Policy generalise across cube placement? For ACT on the combined dataset, I saw this to some degree. 84% strict success on narrow positions and 32% on wide ones tell me the policy has learned something, but I haven’t tried placing cubes completely outside of the training distribution. Seeing the drop in successes from narrow to wide placement, I doubt I would see many successes there. For Diffusion Policy, I never got a working baseline, so I can’t say.
Does adding wider-placement training data help? Honestly, inconclusive. Evaluation round 1 was too noisy to read. The only thing my evaluation suggests is that training on the wide dataset alone was worse than narrow or combined.
How do ACT and Diffusion Policy compare? Same problem. Without a working Diffusion Policy, I can’t compare fairly.

Looking at this, one could think these experiments were a failure, but I’m quite happy with the results for now. The actual learning this week happened mostly around the experiments. The meta-lessons I listed above. But this week also fully caught my interest and motivates me to keep working. I can’t wait to start experimenting with newer algorithms and approaches to teach robots different skills. It’s an amazing time to be working in this field, and I can only encourage everyone to join!