REDS: Low Success Rate Despite Task Completion?

by Rajiv Sharma 48 views

Hey guys, I'm diving deep into the world of Reinforcement Learning from Demonstration (RLfD) and tackling the reproduction of experiments from the groundbreaking paper, "Subtask-Aware Visual Reward Learning from Segmented Demonstrations." I've hit a bit of a snag and wanted to share my journey, the challenges, and hopefully get some insights from the community.

The Mission: Reproducing the Magic

My goal is to meticulously reproduce the experiments outlined in the paper, focusing on the fascinating world of visual reward learning. The two-stage process is where the magic happens, and I'm breaking it down step by step.

Stage 1: Reward Model Training – Laying the Foundation

The initial stage revolves around training a robust reward model. This model is the linchpin, as it learns to discern and quantify the rewards associated with various subtasks within the environment. I kicked things off with the following command, a crucial step in setting the stage for policy learning:

bash scripts/train_reds_metaworld_step1.sh door-open 2 0 3000 100000 50 50 /root/autodl-tmp/

This command orchestrates the training process for the door-open task within the MetaWorld environment. Let's dissect the key parameters:

  • door-open: Specifies the task at hand – opening a door. This is our primary focus, and the reward model will be tailored to understand the nuances of this specific task.
  • 2: Represents the number of parallel environments utilized during training. This parallelism accelerates the learning process, allowing the model to experience a wider range of scenarios.
  • 0: Denotes the random seed for reproducibility. This ensures that the experiments can be consistently replicated, a cornerstone of scientific rigor.
  • 3000: Indicates the number of training epochs. Each epoch signifies a complete pass through the entire training dataset, gradually refining the reward model's understanding.
  • 100000: Sets the total number of training steps. This parameter governs the overall duration of the training process, allowing the model to converge towards optimal performance.
  • 50 50: Defines the number of positive and negative demonstrations used for training. Positive demonstrations showcase successful executions of the task, while negative demonstrations illustrate failures. The balance between these demonstrations is crucial for effective learning.
  • /root/autodl-tmp/: Specifies the base path for storing training logs and model checkpoints. This ensures that the valuable outputs of the training process are securely preserved.

The reward model, once trained, acts as a guiding beacon for the agent in the subsequent policy training stage. It provides a signal that correlates with task progress, enabling the agent to learn effective strategies.

Stage 2: Policy Training – Putting Knowledge into Action

With the reward model in place, it's time to train the agent's policy. This is where the agent learns to interact with the environment, leveraging the learned rewards to achieve the desired task. The following command initiates the policy training process:

DEVICE_ID=0
TASK_NAME=door-open
SEED=0
BASE_PATH=/root/autodl-tmp

XLA_PYTHON_CLIENT_PREALLOCATE=false LD_PRELOAD="" \
CUDA_VISIBLE_DEVICES=${DEVICE_ID} python scripts/train_dreamer.py \
    --configs=reds_prior_rb metaworld \
    --reward_model_path=${BASE_PATH}/reds_logdir/REDS/metaworld-${TASK_NAME}/${TASK_NAME}_phase2/s0/last_model.pkl \
    --logdir=${BASE_PATH}/exp_local/${TASK_NAME}_reds_seed${SEED} \
    --task=metaworld_${TASK_NAME} \
    --env.metaworld.reward_type=sparse \
    --seed=${SEED}

Let's break down the key components of this command:

  • DEVICE_ID=0: Designates the GPU device to be used for training. This is crucial for leveraging the computational power of GPUs, significantly accelerating the learning process.
  • TASK_NAME=door-open: Reaffirms the task at hand, ensuring consistency throughout the experimental setup. The policy will be specifically trained to excel at the door-open task.
  • SEED=0: Maintains the random seed for reproducibility, guaranteeing consistent results across multiple runs.
  • BASE_PATH=/root/autodl-tmp: Specifies the base path, mirroring the setting from Stage 1. This ensures that the trained reward model can be seamlessly accessed.
  • --configs=reds_prior_rb metaworld: Selects the configuration files for the experiment. These files dictate the architecture of the agent, the training parameters, and other critical settings.
  • --reward_model_path=${BASE_PATH}/reds_logdir/REDS/metaworld-${TASK_NAME}/${TASK_NAME}_phase2/s0/last_model.pkl: This is a pivotal parameter, as it provides the path to the trained reward model from Stage 1. The policy will directly utilize this model to guide its learning.
  • --logdir=${BASE_PATH}/exp_local/${TASK_NAME}_reds_seed${SEED}: Specifies the directory for storing training logs and checkpoints. This ensures that the progress of policy training is meticulously tracked.
  • --task=metaworld_${TASK_NAME}: Reinforces the task specification, ensuring that the environment is correctly initialized.
  • --env.metaworld.reward_type=sparse: This is a critical setting, as it defines the type of reward function used in the environment. A sparse reward function provides a reward only when the task is successfully completed, posing a significant challenge for the agent.
  • --seed=${SEED}: Again, the random seed for reproducibility, maintaining consistency throughout the experiment.

This stage is where the agent truly learns to shine, taking the insights from the reward model and transforming them into skillful actions within the environment.

The Puzzle: Success Eluding the Agent

After the training whirlwind, I eagerly checked the final evaluation logs (e.g., 20250805T210535_249_e0bd_failure.npz). To my dismay, the success rate stubbornly remained at 0. This was quite puzzling, especially considering the rollout videos painted a different picture – the door was indeed opening, a clear indication of task completion!

My initial thought was that the agent might not be quite meeting the success threshold defined within the environment. It's like acing the test but missing the passing grade by a hair's breadth. The agent was performing admirably, yet the evaluation metric wasn't reflecting its achievements.

Deep Dive into the Investigation

I embarked on a thorough investigation, leaving no stone unturned in my quest to unravel this mystery. Here's a glimpse into my debugging process:

  • Reward Model Verification: I meticulously verified that the reward_model_path was indeed pointing to the correct trained model. The model loaded without a hitch, confirming its integrity.
  • Consistency Checks: I double-checked that the crucial parameters – camera_keys, window_size, and skip_frame – were perfectly aligned between Stage 1 (reward model training) and Stage 2 (policy training). Mismatched parameters can lead to a distorted perception of the environment, hindering the agent's performance.
  • Negative Demonstration Tuning: I experimented with the number of negative demonstrations (NUM_FAILURE_DEMOS). Initially, I used 50 negative demonstrations, but I also tried reducing this number and even disabling them altogether. The goal was to assess the impact of negative examples on the reward model's learning.
  • Sparse Reward Confirmation: I meticulously confirmed that the Dreamer evaluation process was indeed using the environment's sparse success criterion. This is crucial, as a mismatch in reward criteria could lead to inaccurate evaluation results.

Despite these meticulous checks, the success rate stubbornly remained at 0. It was time to seek guidance from the experts.

The Question: Seeking Wisdom from the Masters

This brings me to my burning question: Are there any recommended methods or parameter settings within the implementation that would allow this task to be recognized as successful during evaluation, ultimately leading to a higher success rate?

I'm eager to tap into the collective wisdom of the community, especially those who have wrestled with similar challenges in RLfD. Any insights, suggestions, or even educated guesses would be immensely valuable. Let's crack this puzzle together and unlock the true potential of this fascinating approach to robot learning!

I'm particularly interested in understanding the nuances of the sparse reward setting. The fact that the agent appears to be completing the task visually, yet the success metric remains at zero, suggests a potential disconnect between the agent's learned behavior and the environment's success criteria.

Could it be a matter of fine-tuning the reward threshold? Are there specific parameters within the Dreamer implementation that govern the sensitivity of the success detection mechanism? Or perhaps the issue lies in the way the sparse reward is being propagated through the learning process?

I'm open to exploring all avenues and trying out different strategies. The journey of research is often filled with unexpected twists and turns, and I'm confident that by pooling our knowledge and expertise, we can overcome this hurdle and achieve a resounding success!

Thanks in advance for your time and consideration. Let's dive into this together and make some breakthroughs!