Practical Aspects of Maximum Likelihood

Target Function vs. Optimization Method

Distinction is sometimes lost, but the two can generally be chosen independently of each other

Target Function = mathematical score for judging the fit of the model to the data

LSQ (RESI): old-fashioned least-squares target

MLF1: maximum likelihood target based on amplitudes, assuming a Gaussian error in measurement of F

MLF2: maximum likelihood target based on intensities, assuming a Gaussian error in measurement of F²

MLHL: maximum likelihood target based on amplitudes plus prior phase information encoded by Hendrickson-Lattman coefficients

Optimization method = mathematical algorithm for improving the target function by changing the model

Monte Carlo: random search using only function values

Genetic algorithms: random search using only function values

Steepest descents: search for minimum follows vector of first derivatives of target function with respect to model parameters

Simulated annealing: first derivatives of target function contribute to forces in molecular dynamics run

Conjugate gradients: second derivative information is inferred from the history of the refinement run

Gradient/curvature: uses first and unmixed second derivatives

Full-matrix: uses first derivatives and full-matrix of mixed second derivatives

Probability distribution of amplitude F_O

Integrate out unknown phase from p(F; F_C, D, ) to get p(F; F_C, D, )

Add observational error to amplitude (or intensity)

p(F_O; F_C, D, , _F) = p(F; F_C, D, ) p(F_O-F)

There is a good Gaussian approximation for the case of a Gaussian error in amplitudes (basis of MLF1 target):

_ML incorporates contributions of both and _F

Calibrating the Likelihood Functions

The likelihood functions include the parameters and D, each of which is a resolution-dependent parameter that can be calculated from

fraction of normalized structure factor that is correct

combined effect of errors and model incompleteness

adjust to maximize likelihood function in SIGMAA

The relative weight between the X-ray terms (data likelihood) and the geometry terms (restraints) is also important

in theory, this falls out directly from respective probabilities

in practice, the diffraction data are overfit, and convergence requires overweighting the data likelihood component

the amount of overfitting depends on the resolution

so does the necessary overweighting

heuristic relative weighting is achieved either by comparing the X-ray and geometry gradients, or by trial and error

divide X-PLOR suggested weight by about two

start from about 50 for TNT

Problems in estimating

Could run original SIGMAA, then estimate using

but is overestimated with data used in refinement

Could use cross-validation (R_free) data to estimate

but original SIGMAA requires 500-1000 reflections per resolution shell to get reliable estimates

Cure: add a smoothing term to exploit the fact that values should vary smoothly with resolution.

this is sufficient to make the estimation behave with about 500-1000 reflections in total.

Initially Estimating and Updating Values

The likelihood target depends on the validity of the values

must not be contaminated by refinement bias

if refinement has been carried out before setting aside cross-validation data, the overfitting bias must be removed

e.g. carry out a run of room-temperature dynamics

values are estimated in resolution bins

there should be enough reflections in each bin to obtain numerical stability

there should be enough bins to capture the variation of with resolution

good compromise -> set the number of bins to:

total number of reflections divided by 1000, or
number of cross-validation reflections divided by 50,
whichever is less

As refinement progresses, the model improves (we hope), so values should increase

The change in values changes the refinement target in a basic way, changing even the optimal value of F_c

unlike the case of the least-squares target

As a result, values should be updated once or twice through a refinement run

Effect of Non-Crystallographic Symmetry

In the presence of NCS, randomly chosen cross-validation reflections are not independent of the working data

statistical relationships between reflections related by the rotational part of the NCS operation

depends on number of NCS-related molecules and on relationship between NCS and crystallographic symmetry

Overfitting of the working data will spread into the cross-validation set

not generally very serious for, say, 2-fold NCS

20-fold NCS hindered ML refinement of one structure we determined

Possible solution: select cross-validation data in thin resolution shells