Intro to LSTMs with Keras/TensorFlow

As I mentioned in my previous post, one of our big focuses recently has been on time series data for either predictive analysis or classification. The intent is to use this in concert with a lot of other tooling in our framework to solve some real-world applications.

One example is a pretty classic time series prediction problem with a customer managing large volumes of finances in a portfolio where the equivalent of purchase orders are made (in extremely high values) and planned cost often drifts from the actual outcomes. The deltas between these two are an area of concern for the customer as they are looking for ways to better manage their spending. We have a proof of concept dashboard tool which rolls up their hierarchical portfolio and does some basic threshold based calculations for things like these deltas.

A much more complex example we are working on in relationship to our trajectories in belief space is the ability to identify patterns of human cultural and social behaviors (HCSB) in computer mediated communication to look for trustworthy information based on agent interaction. One small piece of this work is the ability to teach a machine to identify these agent patterns over time. We’ve done various unsupervised learning which in combination with techniques such as dynamic time warping (DTW) have been successful at discriminating agents in simulation, but has some major limitations.

For many time series problems a very effective method of applying deep learning is using Recurrent Neural Networks (RNN) which allow history of the series to help inform the output. This is particularly important in cases involving language such as machine translation or autocompletion where the context of the sentence may be formed by elements spoken earlier in the text. Convolutional networks (CNNs) are most effective when the tensor elements have a distinct positional meaning in relationship to each other. The most common examples is a matrix of pixel values where the value of the pixel has a direct relevance to nearby pixels. This allows for some nice parallelization, and other optimizations because you can make some assumptions that a small window of pixels will be relevant to each other and not necessarily dependent on “meaning” from pixels somewhere else in the picture. This is obviously a very simplified explanation, and there are lots of ways CNNs are being expanded to have broader applications including for language.

In any case, despite recent cases being made for CNNs being relevant for all ML problems: the truth is RNNs are particularly good at sequentially understood problems which rely on the context of the entire series of data. This is of course useful for time series data as well as language problems.

The most common and popular example of RNN implementation for this is the Long Short-Term Memory (LSTM) RNN. I won’t dive into all of the details of how LSTMs work under the covers, but I think its best understood by saying: While in a traditional artificial neural network each neuron has a single activation function that passes a single value onward, LSTMs have units (or cells in some literature) which are more complex consisting most commonly of  a memory cell, an input gate, an output gate and a forget gate. For a given LSTM layer, it will have a configured amount of fully connected LSTM units, each of which contains the above pieces. This allows each unit to have some “memory” of previous pieces of information, which helps the model to factor in things such as language context or patterns in the data occurring over time. Here is a link for a more complete explanation:

Training LSTMs isn’t much different than training any NN, it uses backpropogation against a training and validation set with configured hyperparemeters and the layout of the layers having a large effect on the performance and accuracy. For most of my work I’ve been using Keras & TensorFlow to implement time series predictions. I have some saved code for doing time series classification, but it’s a slightly different method. I found a wide variety of helpful examples early on, but they included some not obvious pitfalls.

Dr. Jason Brownlee at has a bunch of helpful introductions to various ML concepts including LSTMs with example data sets and code. I appreciated his discussion about the things which the tutorial example doesn’t explicitly cover such as non-stationary data without preprocessing, model tuning, and model updates. You can check this out here:

Note: The configurations used in this example suffices to explain how LSTMs work, but the accuracy and performance isn’t good. A single layer of a small number of LSTM cells running a large number of epochs of training results in pretty wide swings of predictive values which can be demonstrated by running a number of runs and comparing the changes in the RMSE scores which can be wildly off run-to-run.

Dr. Brownlee does have additional articles which go into some of the ways in which this can be improved such as his article on stacked LSTMs:

Jakob Aungiers ( has the best introduction to LSTMs that I have seen so far. His full article on LSTM time series prediction can be found here: while the source code (and a link to a video presentation) can be found here:

His examples are far more robust including stacked LSTM layers, far more LSTM units per layer, and well characterized sample data as well as more “realistic” stock data. He uses windowing, and non-stationary data as well. He has also replied to a number of comments with detailed explanations. This guy knows his stuff.