Utils

Functions:

calculate_advantage –

Calculate advantage values for a row of data.
calculate_reward_with_implicit_kl –

Calculate reward with implicit KL penalty.
masked_mean –

Compute mean of tensor with a masked values.
masked_sum –

Compute sum of tensor with a masked values.
replace_dataset_column –

Replace a column in the dataset with a new column.

`calculate_advantage(row)`

Calculate advantage values for a row of data.

Parameters:

row (dict) –
Dictionary containing rewards and statistics with keys:
- rewards: List of reward values
- reward_mean: Mean reward value
- reward_std: Standard deviation of rewards

Returns:

list[float] –

List of advantage values calculated as (reward - mean)/(std + eps) where eps=1e-4 is added for numerical stability

Source code in tapeagents/finetune/rl/utils.py

def calculate_advantage(row):
    """
    Calculate advantage values for a row of data.

    Args:
        row (dict): Dictionary containing rewards and statistics with keys:

            - rewards: List of reward values
            - reward_mean: Mean reward value
            - reward_std: Standard deviation of rewards

    Returns:
       (list[float]): List of advantage values calculated as (reward - mean)/(std + eps)
            where eps=1e-4 is added for numerical stability
    """
    rewards = row["rewards"]
    mean = row["reward_mean"]
    std = row["reward_std"]
    return [(reward - mean) / (np.nan_to_num(std) + 1e-4) for reward in rewards]

`calculate_reward_with_implicit_kl(row, reward_minus_kl_coef)`

Calculate reward with implicit KL penalty.

Parameters:

row (dict) –
Dictionary containing reward and log probability data with keys:
- reward: Base reward value
- old_logprobs: Log probabilities from old policy
- ref_logprobs: Reference log probabilities
reward_minus_kl_coef (float) –

Coefficient for implicit KL penalty term

Returns:

float –

Reward value adjusted by implicit KL penalty, calculated as: reward - reward_minus_kl_coef * KL(ref||old) The KL divergence is approximated using the Schulman approximation: KL ≈ exp(log_ratio) - log_ratio - 1 where log_ratio = ref_logprobs - old_logprobs

Source code in tapeagents/finetune/rl/utils.py

def calculate_reward_with_implicit_kl(row, reward_minus_kl_coef):
    """
    Calculate reward with implicit KL penalty.

    Args:
        row (dict): Dictionary containing reward and log probability data with keys:

            - reward: Base reward value
            - old_logprobs: Log probabilities from old policy
            - ref_logprobs: Reference log probabilities
        reward_minus_kl_coef (float): Coefficient for implicit KL penalty term

    Returns:
        (float): Reward value adjusted by implicit KL penalty, calculated as:
            reward - reward_minus_kl_coef * KL(ref||old)
            The KL divergence is approximated using the Schulman approximation:
            KL ≈ exp(log_ratio) - log_ratio - 1
            where log_ratio = ref_logprobs - old_logprobs
    """
    reward = row["reward"]
    old_logprobs = row["old_logprobs"]
    ref_logprobs = row["ref_logprobs"]
    log_ratio_ref_old = ref_logprobs - old_logprobs
    kl = (np.exp(log_ratio_ref_old) - log_ratio_ref_old - 1).sum()  # Schulman KL approx
    return reward - reward_minus_kl_coef * kl

`masked_mean(values, mask, axis=None)`

Compute mean of tensor with a masked values.

Source code in tapeagents/finetune/rl/utils.py

def masked_mean(values: torch.Tensor, mask: torch.Tensor, axis: Optional[bool] = None) -> torch.Tensor:
    """Compute mean of tensor with a masked values."""
    if axis is not None:
        return (values * mask).sum(axis=axis) / mask.sum(axis=axis)  # type: ignore
    else:
        return (values * mask).sum() / mask.sum()

`masked_sum(values, mask, axis=None)`

Compute sum of tensor with a masked values.

Source code in tapeagents/finetune/rl/utils.py

def masked_sum(values: torch.Tensor, mask: torch.Tensor, axis: Optional[bool] = None) -> torch.Tensor:
    """Compute sum of tensor with a masked values."""
    if axis is not None:
        return (values * mask).sum(axis=axis)  # type: ignore
    else:
        return (values * mask).sum()

`replace_dataset_column(dataset, column_name, new_column)`

Replace a column in the dataset with a new column.

Source code in tapeagents/finetune/rl/utils.py

def replace_dataset_column(dataset: Dataset, column_name: str, new_column: List[List[float]]) -> Dataset:
    """
    Replace a column in the dataset with a new column.
    """
    if column_name in dataset.features:
        dataset = dataset.map(remove_columns=[column_name])
    dataset = dataset.add_column(name=column_name, column=new_column)  # type: ignore

    return dataset