Memory efficient sliding window sequence learning with TensorFlow

I am training an auto-encoder network used to encodes multivariate time sequences over a large data set. I uploaded a fully working example as a gist.

Even though it is working, I have to decide between an extremely memory inefficient or a slow solution. And I would like to get my memory efficient solution faster.

In my memory inefficient setup I prepare the training set as in plain_dataset. I.e. by materializing sliding windows over the whole data set. (in my real training setup those are highly overlapping)

Instead I want to define my training set like in idx_dataset as a list of (dataset_index, row_index, size) – tuples. Before each training step this references are resolved and actual training example is compiled.

I converted the time series data set to a RaggedTensor rt to be able to access the data in graph mode. Then I implemented the index resolution in resolve_index_batch as part of my custom_fit function.

My hope was that composing this training example is quite cheap compare to the actual training step but the throughput is almost half when using the index training set.

Any ideas how to make the resolve_index_batch function more efficient?

@tf.function
def resolve_index_batch(idx_batch):
    """
    param idx_batch: (32x3) int32 tensor. Each row containing [time_series_idx, row_idx, window_size] 
    """
    return tf.map_fn(fn=lambda idx: tf.slice(rt[idx[0]], [idx[1], 0], [idx[2], -1]), elems=idx_batch, fn_output_signature=tf.float32)

@tf.function
def train_step(batch):
    if mode == 'idx':
        # apply the conversion from indexes batch to data_series batch
        batch = resolve_index_batch(batch)

    # train on reversed time series
    batch_flip = tf.reverse(batch, [1])
    with tf.GradientTape() as tape:
        m = model(batch, training=True)
        loss = loss_fun(batch_flip, m)
        grad = tape.gradient(loss, model.trainable_weights)

    opt.apply_gradients(zip(grad, model.trainable_weights))
    return loss

Answer

To speed up resolve_index_batch, use tf.gather to replace tf.map_fn. tf.map_fn is just a trap that many people fall into it and it performs very poorly in GPU. tf.gather is the correct way to vectorize your operations.

Example codes:

@tf.function
def resolve_index_batch_fast(idx_batch):
    """
    note that window length should be a constant, which is the case in your gist codes WINDOW_LEN = 14
    """
    batch_size = idx_batch.shape[0] #if batch size is constant, this line can be optimized as well
    return tf.gather(tf.gather(rt,idx_batch[:,0]),tf.tile(tf.range(window_length)[None,:],[batch_size,1])+idx_batch[:,1:2],batch_dims=1)

In my experimentation, it is at least multiple times faster than resolve_index_batch in GPU depending on the batch size. The larger the batch size, the larger the speedup. In CPU, tf.map_fn works fairly good though.

To compare the performance between plain approach vs index approach, it is not so meaningful to have data this small in which the entire rt can fit into the GPU memory. In practical and realistic problem size for deep learning, the entire rt should be much larger and can only reside in CPU memory or even SSD.

So first thing to do is to ensure rt is in CPU memory by doing this:

def series_ragged_tensor():
    with tf.device('/CPU:0'):
        rt = tf.RaggedTensor.from_row_lengths(tf.concat(data_series(), axis=0), [ser.shape[0] for ser in data_series()])
    return rt

Second, do data preparation in CPU asynchronously. Inside def idx_dataset():

return tf.data.Dataset.from_tensor_slices(arr).batch(32).map(resolve_index_batch).repeat().prefetch(tf.data.AUTOTUNE)

Third, move resolve_index_batch definition and rt = series_ragged_tensor() to the right place and make train_step the same for index mode and plain mode accordingly.