I am training an auto-encoder network used to encodes multivariate time sequences over a large data set. I uploaded a fully working example as a gist.
Even though it is working, I have to decide between an extremely memory inefficient or a slow solution. And I would like to get my memory efficient solution faster.
In my memory inefficient setup I prepare the training set as in
plain_dataset. I.e. by materializing sliding windows over the whole data set. (in my real training setup those are highly overlapping)
Instead I want to define my training set like in
idx_dataset as a list of
(dataset_index, row_index, size) – tuples. Before each training step this references are resolved and actual training example is compiled.
I converted the time series data set to a RaggedTensor
rt to be able to access the data in graph mode. Then I implemented the index resolution in
resolve_index_batch as part of my
My hope was that composing this training example is quite cheap compare to the actual training step but the throughput is almost half when using the index training set.
Any ideas how to make the
resolve_index_batch function more efficient?
@tf.function def resolve_index_batch(idx_batch): """ param idx_batch: (32x3) int32 tensor. Each row containing [time_series_idx, row_idx, window_size] """ return tf.map_fn(fn=lambda idx: tf.slice(rt[idx], [idx, 0], [idx, -1]), elems=idx_batch, fn_output_signature=tf.float32) @tf.function def train_step(batch): if mode == 'idx': # apply the conversion from indexes batch to data_series batch batch = resolve_index_batch(batch) # train on reversed time series batch_flip = tf.reverse(batch, ) with tf.GradientTape() as tape: m = model(batch, training=True) loss = loss_fun(batch_flip, m) grad = tape.gradient(loss, model.trainable_weights) opt.apply_gradients(zip(grad, model.trainable_weights)) return loss
To speed up
tf.gather to replace
tf.map_fn is just a trap that many people fall into it and it performs very poorly in GPU.
tf.gather is the correct way to vectorize your operations.
@tf.function def resolve_index_batch_fast(idx_batch): """ note that window length should be a constant, which is the case in your gist codes WINDOW_LEN = 14 """ batch_size = idx_batch.shape #if batch size is constant, this line can be optimized as well return tf.gather(tf.gather(rt,idx_batch[:,0]),tf.tile(tf.range(window_length)[None,:],[batch_size,1])+idx_batch[:,1:2],batch_dims=1)
In my experimentation, it is at least multiple times faster than
resolve_index_batch in GPU depending on the batch size. The larger the batch size, the larger the speedup. In CPU,
tf.map_fn works fairly good though.
To compare the performance between plain approach vs index approach, it is not so meaningful to have data this small in which the entire
rt can fit into the GPU memory. In practical and realistic problem size for deep learning, the entire
rt should be much larger and can only reside in CPU memory or even SSD.
So first thing to do is to ensure
rt is in CPU memory by doing this:
def series_ragged_tensor(): with tf.device('/CPU:0'): rt = tf.RaggedTensor.from_row_lengths(tf.concat(data_series(), axis=0), [ser.shape for ser in data_series()]) return rt
Second, do data preparation in CPU asynchronously. Inside
resolve_index_batch definition and
rt = series_ragged_tensor() to the right place and make
train_step the same for index mode and plain mode accordingly.