Building AI to Make LoFi Hip Hop: an introduction to recurrent neural networks

Nina Maria Tremblay
10 min readFeb 25, 2020

Oh, 2016.

What an absolute nightmare of a year.

Between one of the most tumultuous presidential elections in American history, the tragic deaths of so many childhood heroes, Brexit, and that one clown scare that still makes me triple-check my doors at night, it was a low point of the last decade.

But, even amidst all the turmoil, one girl still gave us hope. No matter the turmoil, the drama, or the suffering the year threw at us, she inspired us to continue our relentless march towards better times.

And I think we all know who I’m talking about.

The face of Lofi Beats To Relax Slash Study To. The video was impossible to miss. Overnight, one little hiccup in YouTube’s algorithm put this video into everybody’s Recommended box. It was like a genre was born out of thin air. And it hasn’t shown any signs of dying back down since.

The first time I listened to LoFi I could only get through ten minutes of the stuff before I knew I hated it. I thought it was overly repetitive, too unstructured, lacked direction — the musical equivalent of a Naruto filler episode.

But one day, I decided to flip it on in the background while I Relax Slash Studied. And by the time the drums came in on the very first track, I finally got what all the hype was about.

Suddenly, my bedroom wasn’t just messy. It was aesthetically disheveled. (Just like in the thumbnail!) My history textbook suddenly seemed like the most interesting time period in human history. I wasn’t slogging through an all-nighter, I was having a relaxing 1 A.M. study session.

In that moment, I became the LoFi Hip Hop girl sitting by her window, Relax Slash Studying.

I’ve had LoFi playing everywhere since then. I flip on the radio during walks, on public transportation, doing dishes, studying, cleaning my room. That’s when I realized something: if I’m playing LoFi 24 hours a day then I’d have to run out eventually.

What would I do then?

I was already exploring AI (and burned through way too many LoFi playlists while I was studying.) So, I decided to put my newfound skills to the test and solve the problem with a Recurrent Neural Network.

(Stick around until the end if you wanna see what the computer made.)

LoFi 101

Most LoFi is pretty simple. Each song has two main parts:

  • a jazzy instrumental
  • drums, usually with a Dilla flair

For now, our neural network is only here to generate the instrumental.

Most of these instrumentals are made up of a handful of chords that loop over and over and over again. These are called progressions. Chords in LoFi are informed by progressions from jazz, like the A-Train Changes (Imaj7-ii7), a chromatic combo (#I-I), or the classic two-five-one (ii-V⁷-I). These progressions are the backbone of almost all the music we listen to today. If we don’t want our song to sound like auto-generated garbage, it’s a good idea to stick to the tried-and-true stuff.

Each of those Roman numerals represents a different chord in the scale. That is to say, if we play a ii chord, a V⁷ chord should follow it. Or if we hear a #I chord, a I chord should come next.

To solve this problem, we can get a computer to predict the next chord in a sequence with a type of artificial intelligence called a neural network.

Neural Nets: The Quick and Dirty Version

A neural network is a program with layers of trainable “neurons” that can be used to interpret never-before-seen data. We call them neurons because their design was inspired by the human brain.

Even though they look like this:

The brain behind our algorithm looks more like this:

Which we visualize like this:

In the diagram above, we see three layers.

Each circle represents one neuron, or node. The input layer is where data is first taken in. The hidden layer is where that data is processed. The output spits out the computer’s solution to our problem. If our data was a question, then the input would be the ears that heard it, the hidden layer would do all the thinking, and then the output layer would tell us an answer. More precisely, one of the output nodes that corresponds with the correct answer would light up.

But the first few (hundred) times the algorithm runs, this won’t happen. We have to train it with ‘’dummy’ data and correct it until it begins to light up the right output node.

We correct it by getting it to readjust the weights and biases associated with each node.

Every node is connected to the others, and each connection is affected by a certain weight which measures its strength. Weights determine the degree to which one neuron influences the other. A bias is an extra number added on at the end that influences how easily that node lights up and passes information onto the next set of nodes.

Sounds nice, but this isn’t gonna make us LoFi.

Making it Actually Work with Recurrent Neural Nets

Each output node could only give us one note at a time. Or, if we programmed it right, one chord quality. But it can’t quite get all the intricacies that we’re looking for in a good LoFi track. Plus, we don’t even know if it would be able to process the music correctly.

That’s where recurrent neural networks come in.

Our hero!

Instead of feeding inputs straight from one layer to the next, it can use the previous layer’s output as its own input. This makes them perfect to interpret sequential data, in which the order of the data matters — like a LoFi sample (which comes as sequences of chords).

On the left, we see a representation of how our code might actually work. But the right gives a step-by-step representation of what the network is doing at every next unit of time that passes.

The current cell state is determined by the previous cell state as well as the current input. A function, parameterized by the set of weights our network creates, passes over both of these values to scale the result.

Alternatively,

H(t) = f(w)(h(t -1), x(t))

Or, to get even more specific,

H(t) = tanh(W(hh) * h(t-1) + W(xh) * x(t))

Both the previous hidden cell state and the current input are affected by the weights of their connections with other nodes. The computer processes these weights as matricies and are multiplied into both values. Then, the previous cell state and the current input are added together to achieve a result. This result is passed through an activation function, which is the tanh above, to map the result along the number line between one and zero, as computers process everything thorugh binary.

This gives us an output vector which we can use to find our loss at every step. Finding the total loss is as simple as adding all of the losses across all timestamps together.

The Big Problem With RNNs

To make our neural net more accurate, we have to make changes. But we can’t change the the current input. We can’t change the previous hidden state. And it’s usually a bad idea to change the activation function. Therefore, the only parameter we can tweak in the above equation is the weights.

To change the weights, we use an algorithm called backpropagation. After the data passes through the entire network, from input to output, the backpropagation algorthm works backwards from the output to the input to succeed.

More specifically, it takes the derivative of the loss with based on each parameter, and then changes those parameters to minimize the loss. It will move negatively down the gradient produced by the derivative to achieve this.

With recurrent neural networks, though, we don’t just backpropagate from output to input. We also propagate backwards from the network’s final point in time to its first. In other words, we have to go back in time! And we’re backpropagating to fix that same reccurrent weight that connects all the timestamps together, which can get messy.

Basically time travel

To achieve this, we multiply the weight matrix associated with each timestamp with the one before. A few times is no big deal. But the more we multiply those less-than-one numbers together, the smaller our numbers get. Too many repetitions and the numbers we use to update that recurrent weight become vanishingly small. By then, the gradient from the beginning would be long gone. (Thus called the vanishing gradient problem.) The numbers would have gotten too small and our network would have “forgotten” the value.

So how do we fix this?

The Big Solution Is Long Short-Term Memory

Sounds like an oxymoron, right?

The name is actually where the power comes from. Long Short-Term Memory architecture expands on the original RNN cell. The long term memory stores the weights, while short term memory stores activations from recent input events like the previous hidden state or the current input. Instead of just combining the previous cell state and the current input, and passing that whole combination through an activation function, LSTM cells allow for more control over how information is processed.

To understand how they work, it’s useful to learn more about some of the key differences between this architecture and a vanilla RNN.

It LOOKS complicated, but we’re about to break it down.

First, they have a cell state that is separate from the information that’s outputted. It’s represented here by that straight line going across the top. When we backpropagate, we’ll only have to tweak the cell state instead of having to do a bunch of unnecssesary multiplications by going down every single one of those paths.

That cell state will be one of the biggest factors to how accurate our neural net will be. It’s super important that it only takes in the right information. So important, in fact, that most of the LSTM architecture is based on optimally updating the cell state.

So, to make sure that it gets the right information, LSTMs use a new tool called gates. Gates control how information is added or removed before it reaches the cell state. Each gate has two parts. The first is an activation function layer, like a sigmoid or a tanh layer, which transforms the input values. Sigmoid layers are good for filtering information, as their only range is between zero and one, while tanh functions are more often used to reshape information. Next, a pointwise multiplication operation incorporates the activation function layer’s output to the next part of the cell.

LSTMs learn new information through a three-step process.

  1. Forget. Denoted by the very first plus sign on the diagram. First, the network uses the previous cell’s output and the current input to decide which information is irrelevant and should be forgotten. These two values are added together, filtered through the sigmoid function, and the remaining information is multiplied into the cell state.
  2. Update. This is how the network can identify new information to be stored in the cell state. This portion is represented by the lines going in and out of the second and third plus signs in the diagram. Again, the previous hidden state and the current input are added together and passed through two different functions. First, the sigmoid layer decides what values should be updated, and then the tanh gate figures out how to do so by generating a vector of possible values that could be added to the state. These two values are multiplied and added to the cell state.
  3. Output. Finally, we use information from the cell state to decide what information from the previous hidden state and the current input to output. A sigmoid function filters the input, and a tanh layer maps the cell state information between -1 and 1. The product of these two layers is then exported as an output and fed back into the layer as our next hidden state.

Luckily, Keras handles all of the backend in our code, so we don’t have to do any of this by hand for now.

Aaaaand here’s our song!

Congrats! You made a chord progression!

Now, all that’s left is to have the computer play it over our drum sample. (Thankfully, we don’t need a whole other neural net for this.)

Here’s the finished result:

And here’s the code to do it, helped along with this AMAZING guide by Sigurður Skúli.

Hi! I’m Nina. I write about a lot of different things, including AI, climate change, and living a more wholesome life. If any of those things interest you, subscribe to my newsletter or check out my website to keep in touch.

--

--