Abstract

생성 모델의 출력을 원하는 동작에 정렬하기 위해 강화학습(RL)을 사용하여 모델을 훈련하는 데 집중한다.
- 보상 모델은 사람이 주석한 데이터로 학습
높은 보상이 바람직하지 않은 패턴으로 잘못 할당되는 세가지 일반적인 경우를 확인
- noise-induced spurious correlation,
- naturally occurring spurious correlation
- covariate shift.
보상 기능을 훈련하는 데 사용되는 데이터 분포에서 높은 성능을 달성하더라도 텍스트 생성 모델의 RL훈련 중, 바람직핮미 않은 패턴이 증폭될 수 있음을 보인다.

1. Introduction

2. Related Work

3. Background

Conditional text generation systems usually model
- $p(y|x)\ where \ x=(x_1,...,x_{Ts}) ,\ y=(y_1,...,y_T)$
- $x$는 source source sequence, $y$는 target sequence
- an autoregressive factorization
  - $log\ p(y|x)= \sum_{t=1}^Nlog \ p_ \theta(y_t|y_{<t},x)\ where \ y_{<t}=(y_1,...,y_{t-1})$
$p_\theta$는 학습에 의해 결정되는 파라메터
생성 과정은 강화학습에 적합한 “순차적 의사 결정과정”이라고 볼 수 있다.
state $s_t=(x, y_{<t})$가 주어졌을 때, policy $\pi_\theta$는 action $a_t$를 취한다.
- action은 token in the vocabulary
이후에, 다음 상태인 $s_{t+1}$로 넘어가고, reward $r_t\in\mathbb{R}$를 받는다.
- 이때 보상모델은 human annotations로 부터 학습된 모델이다.
discount factor를 $\gamma=1$이라고 가정하자
To Maximize the objective $J(\theta)=\mathbb{E}{\tau\sim\pi\theta}R(x,y),\ where \ R(x,y)=\sum_{t=1}^Tr_t$
one way is to use policy gradient : REINFORCE
- $\bigtriangledown J(\theta)=\mathbb{E}{\tau\sim\pi\theta}\sum_t\bigtriangledown_\theta log_{\pi_\theta}(a_t|s_t) \hat{Q}(s_t,a_t), \ where \ \hat{Q}(s_t,a_t)=\sum_{t'=t}^Tr_{t'}$ is the estimated return.
보상모델이 망가지는 것을 회피하는 것이 목적(It aims to avoid reward performance collapse)
아래 두 특성은 서로 직교(orthogonal)한다.
- the choice of algorithm that makes generations achieve high rewards
- high rewards can correspond to undesirable generations
  - 최대로 보상을 얻을 수 있는 알고리즘을 선택하는 것 vs 최대보상이 바람직하지 않은 결과를 만들 수 있는 것