We support both the noise prediction model ("predicting epsilon") and the data prediction model ("predicting x0").
If `predict_x0` is False, we use the solver for the noise prediction model (DPM-Solver).
If `predict_x0` is True, we use the solver for the data prediction model (DPM-Solver++).
In such case, we further support the "dynamic thresholding" in [1] when `thresholding` is True.
The "dynamic thresholding" can greatly improve the sample quality for pixel-space DPMs with large guidance scales.
Args:
model_fn: A noise prediction model function which accepts the continuous-time input (t in [epsilon, T]):
``
def model_fn(x, t_continuous):
return noise
``
noise_schedule: A noise schedule object, such as NoiseScheduleVP.
predict_x0: A `bool`. If true, use the data prediction model; else, use the noise prediction model.
thresholding: A `bool`. Valid when `predict_x0` is True. Whether to use the "dynamic thresholding" in [1].
max_val: A `float`. Valid when both `predict_x0` and `thresholding` are True. The max value for thresholding.
[1] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b.
"""
self.model=model_fn
self.noise_schedule=noise_schedule
self.predict_x0=predict_x0
self.thresholding=thresholding
self.max_val=max_val
defnoise_prediction_fn(self,x,t):
"""
Return the noise prediction model.
"""
returnself.model(x,t)
defdata_prediction_fn(self,x,t):
"""
Return the data prediction model (with thresholding).
The adaptive step size solver based on singlestep DPM-Solver.
Args:
x: A pytorch tensor. The initial value at time `t_T`.
order: A `int`. The (higher) order of the solver. We only support order == 2 or 3.
t_T: A `float`. The starting time of the sampling (default is T).
t_0: A `float`. The ending time of the sampling (default is epsilon).
h_init: A `float`. The initial step size (for logSNR).
atol: A `float`. The absolute tolerance of the solver. For image data, the default setting is 0.0078, followed [1].
rtol: A `float`. The relative tolerance of the solver. The default setting is 0.05.
theta: A `float`. The safety hyperparameter for adapting the step size. The default setting is 0.9, followed [1].
t_err: A `float`. The tolerance for the time. We solve the diffusion ODE until the absolute error between the
current time and `t_0` is less than `t_err`. The default setting is 1e-5.
solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers.
The type slightly impacts the performance. We recommend to use 'dpm_solver' type.
Returns:
x_0: A pytorch tensor. The approximated solution at time `t_0`.
[1] A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and I. Mitliagkas, "Gotta go fast when generating data with score-based models," arXiv preprint arXiv:2105.14080, 2021.
fromldm.modules.x_transformerimportEncoder,TransformerWrapper# TODO: can we directly rely on lucidrains code and simply add this as a reuirement? --> test