Train Natural Language Dialogue Task¶

In this section, we will introduce how to import Hugging Face models and datasets into OpenRL, as well as how to use a custom reward model and customize the output of wandb through a natural language dialogue task ( DailyDialog ).

Introduction to DailyDialog¶

DailyDialog is a multi-turn dialogue dataset, which contains a total of 13,000 dialogues.

The following image shows an example dialogue from DailyDialog:

In the past, supervised learning was commonly used for training natural language tasks, but recent research has shown that reinforcement learning can also be used to train language models, and can significantly improve model performance (see [1][2][3]).

Next, we will provide a detailed introduction on how to use OpenRL to train natural language tasks.

Creat Environment and Load Dataset¶

Training for natural language tasks involves the use of some additional packages. Users can install these packages using the following command:

pip install "openrl[nlp]"

Similar to the tutorial on multi-agent RL (MPE ), we first need to write a train_ppo.py file and include the following training code:

# train_ppo.py
from openrl.envs.common import make
from openrl.modules.common import PPONet as Net
from openrl.runners.common import PPOAgent as Agent
from openrl.configs.config import create_config_parser
def train():
    # Add code to read the configuration file.
    cfg_parser = create_config_parser()
    cfg = cfg_parser.parse_args()
    # Create NLP environment.
    env = make("daily_dialog",env_num=2,asynchronous=True,cfg=cfg)
    # Create neural network.
    net = Net(env, cfg=cfg, device="cuda")
    # Create training agent.
    agent = Agent(net)
    # Begin training.
    agent.train(total_time_steps=100000)
    # Save the trained agent.
    agent.save("./ppo_agent/")
if __name__ == "__main__":
    train()

Then, we can create a configuration file named nlp_ppo.yaml and add the following content:

# nlp_ppo.yaml
env: # environment parameters
    args: {
        "tokenizer_path": gpt2, # path to load tokenizer
        "data_path": daily_dialog # dataset name or path
    }
seed: 0 # set seed
lr: 1e-6 # set learning rate of policy model
critic_lr: 1e-6 # set learning rate of critic model
episode_length: 20 # set the length of each episode
use_recurrent_policy: true

From the above configuration file, Training an NLP task requires additional settings for the environment parameter env.args. The tokenizer_path in the environment parameters is used to specify the Hugging Face name or local path for loading text encoders. In addition, data_path in the environment parameters can be set to either a Hugging Face dataset name or a local dataset path.

Train with Hugging Face’s models.¶

In OpenRL, we can use models from Hugging Face for training. To load a model from Hugging Face, we first need to add the following content in the configuration file nlp_ppo.yaml :

# nlp_ppo.yaml
model_path: rajkumarrrk/gpt2-fine-tuned-on-daily-dialog # pre-trained model name or path
use_share_model: true
ppo_epoch: 5 # ppo iteration times

env: # environment parameters
    args: {
        "tokenizer_path": gpt2, # path to load tokenizer
        "data_path": daily_dialog # dataset name or path
    }
lr: 1e-6 # set learning rate of policy model
critic_lr: 1e-6 # set learning rate of critic model
episode_length: 128 # set the length of each episode
num_mini_batch: 20

Then you need to add the following code in train_ppo.py :

# train_ppo.py
from openrl.envs.common import make
from openrl.modules.common import PPONet as Net
from openrl.runners.common import PPOAgent as Agent
from openrl.configs.config import create_config_parser
from openrl.modules.networks.policy_value_network_gpt import (
    PolicyValueNetworkGPT as PolicyValueNetwork,
)
def train():
    # Add code to read the configuration file.
    cfg_parser = create_config_parser()
    cfg = cfg_parser.parse_args()
    # Create NLP environment.
    env = make("daily_dialog",env_num=2,asynchronous=True,cfg=cfg)
    # Create neural network.
    model_dict = {"model": PolicyValueNetwork}
    net = Net(env, cfg=cfg, model_dict=model_dict)
    # Create training agent.
    agent = Agent(net)
    # Begin training.
    agent.train(total_time_steps=100000)
    # Save the trained agent.
    agent.save("./ppo_agent/")
if __name__ == "__main__":
    train()

By making the simple modifications outlined above, users can train using pre-trained models on Hugging Face.

Note

In the above example, we used the model PolicyValueNetworkGPT . OpenRL also supports user-defined models (such as a custom model named CustomPolicyValueNetwork ), which can be passed into the training network in the following way:

model_dict = {"model": CustomPolicyValueNetwork}
net = Net(env, model_dict=model_dict)

If you want to separately implement the policy network and value network, you can achieve it through the following methods:

model_dict = {
    "policy": CustomPolicyNetwork,
    "critic": CustomValueNetwork,
}
net = Net(env, model_dict=model_dict)

The implementation method of custom models can refer to PolicyValueNetworkGPT、PolicyNetwork and ValueNetwork 。

Use Reward Model¶

Usually, the datasets for natural language tasks do not include reward information. Therefore, if reinforcement learning is used to train a natural language task, an additional reward model needs to be used to generate rewards.

In this DailyDialog task, we will use a composite reward model that includes the following three parts:

Intent Reward：When the generated text by the agent is close to the expected intent, the agent can receive higher rewards.
METEOR Metric Reward： METEOR is a metric used to evaluate text generation quality and can be used to measure how similar generated texts are compared with expected ones. We use this metric as feedback for rewards given to agents in order to optimize their text generation performance.
KL Divergence Reward：This reward is used to limit how much text generated by agents deviates from pre-trained models and prevent issues of reward hacking.

Our final reward is a weighted sum of these three rewards where the coefficient of KL Divergence Reward changes dynamically based on its value.

To use this reward model in OpenRL, users do not need to modify training code but only need add reward_class parameter in nlp_ppo.yaml file.

# nlp_ppo.yaml
reward_class:
    id: NLPReward # reward model name
    args: {
        # The name or path of the model used for intent recognition.
        "intent_model": rajkumarrrk/roberta-daily-dialog-intent-classifier,
        # The name or path of the pre-trained model used for calculating KL divergence.
        "ref_model": rajkumarrrk/gpt2-fine-tuned-on-daily-dialog,
    }

model_path: rajkumarrrk/gpt2-fine-tuned-on-daily-dialog # pre-trained model name or path
use_share_model: true
ppo_epoch: 5 # ppo iteration times
env: # environment parameters
    args: {
        "tokenizer_path": gpt2, # path to load tokenizer
        "data_path": daily_dialog # dataset name or path
    }
lr: 1e-6 # set learning rate of policy model
critic_lr: 1e-6 # set learning rate of critic model
episode_length: 128 # set the length of each episode
num_mini_batch: 20

Note

OpenRL supports users to use custom reward models. First, the user needs to write a custom reward model (which needs to inherit from the BaseReward class). Then, the user needs to register the custom reward model by adding the following code in train_ppo.py :

# train_ppo.py
from openrl.rewards.nlp_reward import CustomReward
from openrl.rewards import RewardFactory
RewardFactory.register("CustomReward", CustomReward)

Finally, the user needs to fill in their custom reward model in nlp_ppo.yaml :

reward_class:
    id: "CustomReward" # custom reward model name
    args: {} # the parameters that may be used in the custom reward model

Customize wandb Output¶

OpenRL also supports user-defined output content for wandb or tensorboard. For example, during the training process of this task, we also need to output information on various types of rewards and KL divergence coefficients. Users can add the vec_info_class parameter in the nlp_ppo.yaml file to achieve this:

# nlp_ppo.yaml
vec_info_class:
    id: "NLPVecInfo" # Call the NLPVecInfo class to print information about rewards in the NLP task.
# set wandb information
wandb_entity: openrl # This is used to specify the wandb team name. Please replace openrl with your own team name.
experiment_name: train_nlp # This is used to specify the name of the experiment.
run_dir: ./run_results/ # This is used to specify the path for saving experimental data.
log_interval: 1 # This is used to specify how often to upload wandb data every few episodes.
# Fill in other parameters yourself...

After modifying the configuration file, enable wandb in the train_ppo.py file:

# train_ppo.py
agent.train(total_time_steps=100000, use_wandb=True)

Then execute ‘python train_ppo.py –config nlp_ppo.yaml’, after a while, you can see the following output in wandb:

From the above figure, we can see that wandb has outputted information on various types of rewards and the KL divergence coefficient.

If the user needs to output other information, they can also refer to the NLPVecInfo class and VecInfo class to implement their own CustomVecInfo class. Then, it is necessary to register the custom CustomVecInfo class in train_ppo.py :

# train_ppo.py
# Register the CustomVecInfo class.
VecInfoFactory.register("CustomVecInfo", CustomVecInfo)

Finally, just fill in the CustomVecInfo class in the nlp_ppo.yaml file:

# nlp_ppo.yaml
vec_info_class:
    id: "CustomVecInfo" # Call the CustomVecInfo class to output custom information.

Accelerate training with Automatic Mixed Precision¶

OpenRL also provides a feature to enable automatic mixed-precision training in one step. Users only need to add the following parameters in the configuration file:

# nlp_ppo.yaml
use_amp: true # Enable automatic mixed precision training.

Tip

Users can find sample code for training nlp tasks in train_ppo.py . Find the parameters for training nlp tasks in nlp_ppo.yaml . Users can execute python train_ppo.py –config nlp_ppo.yaml to train the conversation task.

Accelerate training with DeepSpeed¶

OpenRL also provides a feature to enable DeepSpeed training in one step. Users first need to add two configuration files:

# ds_config.yaml
{
  "train_batch_size": 32, # train_batch_size = episode_length * env_num / num_mini_batch
  "train_micro_batch_size_per_gpu": 16, # train_micro_batch_size_per_gpu = train_batch_size / num_gpu
  "steps_per_print": 10,
  "zero_optimization": {
      "stage": 2, # default to use Zero2
      "reduce_bucket_size": 5e7,
      "allgather_bucket_size": 5e7
  },
  "fp16": {"enabled": false, "loss_scale_window": 100} # whether to use fp16
}
# eval_ds_config.yaml
{
  "train_batch_size": 32,
  "train_micro_batch_size_per_gpu": 16,
  "steps_per_print": 10,
  "zero_optimization": {
    "stage": 0, # default to use cpu offload for ref_model and reward model
    "offload_param": {"device": "cpu"}
},
  "fp16": {"enabled": false} # whether to use fp16
}

Next enable DeepSpeed in nlp_ppo_ds.yaml :

use_deepspeed: true
use_fp16: false
use_offload: false
deepspeed_config: ds_config.json
reward_class:
  id: "NLPReward"
  args: {
    "use_deepspeed": true,
    "ref_ds_config": "eval_ds_config.json", # use eval ds config for ref model
    "ref_model": "rajkumarrrk/gpt2-fine-tuned-on-daily-dialog",
    "intent_ds_config": "eval_ds_config.json", # use eval ds config for reward model
    "intent_model": "rajkumarrrk/roberta-daily-dialog-intent-classifier",
  }

Tip

episode_length and num_mini_batch can be found in nlp_ppo_ds.yaml ; env_num can be found in train_ppo.py ; please ensure that all parameters meet the following relationship: train_batch_size = episode_length * env_num / num_mini_batch .

Finally, please run the command to start training:

deepspeed train_ppo.py --config nlp_ppo_ds.yaml

Training results of OpenRL¶

The table below shows the results of training the dialogue task using OpenRL. The results indicate that after training with reinforcement learning, all model indicators have improved. In addition, from the table below, it can be seen that compared to RL4LMs , OpenRL has a faster training speed (on the same server with NVIDIA 3090 GPUs, the speed is increased by 17.2%), and better final performance:

	FPS(Speed)	Rouge-1	Rouge-Lsum	Meteor	SacreBLEU	Intent Reward	Mean Output Length
Supervised Learning	None	0.164	0.137	0.234	0.063	0.427	18.95
RL4LMs	11.26	0.169	0.144	0.198	0.071	0.455	18.83
OpenRL	13.20(+17%)	0.181(+10%)	0.153(+12%)	0.292(+25%)	0.090(+43%)	0.435(+1.9%)	18.69

The table below shows that compared to OpenRL with Data-Parallel, OpenRL with DeepSpeed has a faster training speed:

	FPS(Speed)	Number of GPUs	Memory Usage per GPU(MB)	GPU Type	Train Micro Batch Size per GPU
DeepSpeed w/ GPT-2-small	5.11(+30%)	2	13537	RTX 3090	8
Data-Parallel w/ GPT-2-small	3.94	2	7207	RTX 3090	8
DeepSpeed w/ OPT-1.3B	7.09(+35%)	4	35360	NVIDIA A100	8
Data-Parallel w/ OPT-1.3B	5.25	4	15854	NVIDIA A100	8

Chat with Trained Agent¶

For a trained agent, users can easily engage in conversation through the agent.chat() function:

# chat.py
from openrl.runners.common import ChatAgent as Agent
def chat():
    agent = Agent.load("./ppo_agent", tokenizer="gpt2",)
    history = []
    print("Welcome to OpenRL!")
    while True:
        input_text = input("> User: ")
        if input_text == "quit":
            break
        elif input_text == "reset":
            history = []
            print("Welcome to OpenRL!")
            continue
        response = agent.chat(input_text, history)
        print(f"> OpenRL Agent: {response}")
        history.append(input_text)
        history.append(response)
if __name__ == "__main__":
    chat()

Execute python chat.py to start a conversation with the trained agent:

Tip

Users can find the sample code for this section in chat.py . In addition, we also provide an example of chatting with the ChatGLM-6B model in chat_6b.py.