Autoencoders in molecular simulations
Introduction
Autoencoders is a neural network architecture in unsupervised learning that maps the input into itself. Say, we have a set of unlabeled training examples
It is the identity function and the neural net would learn to approximate it. For example, we have input x to be the pixel intensity values from a 28x28 images (784 pixels) so n = 784, and there are 300 hidden units in the hidden layer. The output would have 784 units as well. Since there are only 300 units in the middle, it must try to learn the correlation among pixels (the structure) and reconstruct the 784 pixel input x from the structure.
The space of compressed representation is called the latent space. The network has two part: an encoder to compress the input and a decoder to expand the pattern back into its orignial size and content. Doing the encoding is like doing a PCA: reduce the dimension of the input. Mathematically, the encoder is the matrix transformation U, we would have
Denosing autoencoders add noise to the input data during training and then learn to reconstruct the original data from the noisy inputs. This is to prevent the trivial solution of simply copying the input over (in case of overcomplete autoencoder: the hidden layer is bigger than input). This gives result in task such as to reduce noise of images. VAEs are probabilistic models that learn to generate new data points from the learned latent space. AAEs use adversarial training to learn the distribution of the input data and the latent space.
Variational autoencoder
A variational autoencoder follows usual autoencoder architecture: with an encoder and a decoder. The layer in the middle, however, is probabilistic. Instead of produce a vector representation for the input, it provides a mean
The cost function has two parts: a usual part to reconstruct the input (MSE - mean squared error). A second part which is the Kullback-Leibler (KL) divergence between the Gaussian distribution and the actual distribution of the latent space.
The first part is to ensure that the output resembles the input:
The second part is to min the distance between latent distribution z and a Gaussian distribution
The resulting decoder can act as a generator: take a random vector in z (or Gaussian) and decode it as an image. Or we can do image interpolation: take two points in Gaussian space and use the decoder to decode images from one point to another.
Code example
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
mnist = tf.keras.datasets.mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = mnist
X_train_full = X_train_full.astype(np.float32) / 255
X_test = X_test.astype(np.float32) / 255
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]
class Sampling(tf.keras.layers.Layer):
def call(self, inputs):
mean, log_var = inputs
return tf.random.normal(tf.shape(log_var)) * tf.exp(log_var / 2) + mean
tf.random.set_seed(42) # extra code – ensures reproducibility on CPU
codings_size = 10
inputs = tf.keras.layers.Input(shape=[28, 28])
Z = tf.keras.layers.Flatten()(inputs)
Z = tf.keras.layers.Dense(150, activation="relu")(Z)
Z = tf.keras.layers.Dense(100, activation="relu")(Z)
codings_mean = tf.keras.layers.Dense(codings_size)(Z) # μ
codings_log_var = tf.keras.layers.Dense(codings_size)(Z) # γ
codings = Sampling()([codings_mean, codings_log_var])
variational_encoder = tf.keras.Model(
inputs=[inputs], outputs=[codings_mean, codings_log_var, codings])
decoder_inputs = tf.keras.layers.Input(shape=[codings_size])
x = tf.keras.layers.Dense(100, activation="relu")(decoder_inputs)
x = tf.keras.layers.Dense(150, activation="relu")(x)
x = tf.keras.layers.Dense(28 * 28)(x)
outputs = tf.keras.layers.Reshape([28, 28])(x)
variational_decoder = tf.keras.Model(inputs=[decoder_inputs], outputs=[outputs])
_, _, codings = variational_encoder(inputs)
reconstructions = variational_decoder(codings)
variational_ae = tf.keras.Model(inputs=[inputs], outputs=[reconstructions])
latent_loss = -0.5 * tf.reduce_sum(
1 + codings_log_var - tf.exp(codings_log_var) - tf.square(codings_mean),
axis=-1)
variational_ae.add_loss(tf.reduce_mean(latent_loss) / 784.)
variational_ae.compile(loss="mse", optimizer="nadam")
history = variational_ae.fit(X_train, X_train, epochs=25, batch_size=128,
validation_data=(X_valid, X_valid))
def plot_reconstructions(model, images=X_valid, n_images=5):
reconstructions = np.clip(model.predict(images[:n_images]), 0, 1)
fig = plt.figure(figsize=(n_images * 1.5, 3))
for image_index in range(n_images):
plt.subplot(2, n_images, 1 + image_index)
plt.imshow(images[image_index], cmap="binary")
plt.axis("off")
plt.subplot(2, n_images, 1 + n_images + image_index)
plt.imshow(reconstructions[image_index], cmap="binary")
plt.axis("off")
plot_reconstructions(variational_ae)
plt.savefig('reconstructed2')
plt.show()
codings = np.zeros([7, codings_size])
codings[:, 3] = np.linspace(-0.8, 0.8, 7)
images = variational_decoder(codings).numpy()
def plot_multiple_images(images, n_cols=None):
n_cols = n_cols or len(images)
n_rows = (len(images) - 1) // n_cols + 1
if images.shape[-1] == 1:
images = images.squeeze(axis=-1)
plt.figure(figsize=(n_cols, n_rows))
for index, image in enumerate(images):
plt.subplot(n_rows, n_cols, index + 1)
plt.imshow(image, cmap="binary")
plt.axis("off")
plot_multiple_images(images)
plt.savefig('interpolation2')
plt.show()
We can use the decoder to interpolate points in the latent space, so that in input space, we see interpolation between objects:
VAE in molecular simulation
In physics, coarse-graining (CG) is the technique to simplify the particle representation by grouping selected atoms into pseudo-beads, this improves simulation computation a great deal, hence facilitate the innovation in studying chemical space (such as protein folding). When doing so, some of the information would be lost in the process. If we want to study interactions at atomic level, we have the opposite process called backmapping: restoring fine-grain (FG) coordinates from CG coordinates. We add lost atomistic details back into the CG representation. Some of the difficulties involve: the randomness of backmapping - many FG configurations can be mapped to the same CG representations, hence the reverse generative map is one-to-many; the geometric consistency requirement: the FG coordinates should abide by the CG geometry in the way that they can be reduced back to the CG. Also, the CG transformation is equivariant, the FG transformation should be equivariant as well; different mapping protocols: there are different dimensionality reduction techniques of CG, depending on tradeoff between accuracy and speed, hence there should be a general backmapping approach that is robust among CG mapping choices.
Image: CG conversion and back (FG)
Recently, researchers have made progress significantly in this task, by utilising machine learning (specifically CGVAE model - Coarse-Graining Variational Auto-Encoder) for generative purpose. This approach can approximate atomistic coordinates based on coarse grained structures pretty well. The mathematical problem is to model the conditional distribution
Let’s use some maths notation. The FG system as atomistic coordinates can be written as
The geometric requirement of backmapping is for the CG to follow the rules of FG in rotate and reflect. It is equivariant requirement. The backmap x can also be mapped back into X itself. Those requirments will be satisfied by the generative CF VAE framework. Since the probability
Further work has adapted the VAE framework proposed above to build an impressive backmapping tool for CG protein representation. They achieve it through: base representation on internal coordinates, enforce an equivariant encoder, a custom loss function that ensures local structure, global structure and physical constraints. They also train the model on a high quality curated protein data.
The loss function is very similar to VAE:
B is the set of all bonds, b is the ground truth and
They also add root mean squared distance (RMSD) loss for Catersian coordinate space:
They also put constraints on the chemical validity of the structure, the following steric loss makes sure the distance between any two non bonded atom pairs larger than 2.0 A.
In total,