Neural Networks and Deep Learning

Summer of Science 2023

Harsh S Roniyar
Indian Institute of Technology Bombay

Abstract

Neural networks and deep learning have revolutionized the field of artificial intelligence and have become indispensable tools in various domains. This report provides an overview of the fundamental concepts, advancements, and applications of neural networks and deep learning.

The report begins with an introduction to neural networks, explaining their structure, operation, and key components, such as neurons, layers, and activation functions. The report examines the training process of neural networks and discusses important techniques such as gradient descent, forward and back-propagation, and vectorization. Then, the report goes on to implement Shallow and Deep Neural Networks

My learning so far is based on Andrew Ng’s Neural Networks and Deep Learning Course and from the book Deep Learning by Goodfellow et al.

Introduction

A neural network is a computational model inspired by the structure and function of the human brain. It is a type of machine learning algorithm that is designed to recognize patterns and make predictions based on input data. Neural networks consist of interconnected nodes, called neurons, organized into layers.

Multi-layer Perceptron (Building block of Neural Networks)

The basic building block of a neural network is the artificial neuron, also known as a perceptron. Each neuron takes in one or more inputs, applies weights to those inputs, sums them up, and then applies an activation function to produce an output. The activation function determines whether the neuron should “fire” and pass its output to the next layer of neurons.

Broad paradigms in Neural Networks (in general, Machine Learning) are

Supervised Learning

In supervised learning the algorithm learns from labeled examples or training data. In supervised learning, the input data is accompanied by corresponding target labels or desired outputs. The goal is to train a model that can learn the mapping between the input data and the target labels, enabling it to make accurate predictions on new, unseen data.

Unsupervised Learning

In unsupervised learning the algorithm learns patterns and structures in the data without any explicit labels or target outputs. Unlike supervised learning, the input data in unsupervised learning is unlabeled, and the algorithm explores the data to find inherent patterns, relationships, or groupings.

Reinforcement (Sequential) Learning

Reinforcement learning is a type of machine learning where an agent learns to make sequential decisions in an environment to maximize a cumulative reward signal. The agent interacts with the environment, takes actions, receives feedback in the form of rewards or penalties, and learns the optimal values to achieve long-term goals.

Unsupervised v/s Supervised v/s Reinforcement Learning

In the subsequent sections below, we will be only discussing ‘Supervised’ Learning with Neural Networks.

Some examples of Neural Networks :

  1. Standard NN

  2. Convolution NN

  3. Recurrent NN

Examples of Neural Networks

Basics of Neural Network Programming

Logistic Regression as a Neural Network

Binary Classification

As is evident, binary classification is a supervised learning algorithm that categorizes observations into two different classes (typically 0 and 1). So primarily, when we want to model Binary Classification, we need outputs from the model to lie between 0 and 1.

Logistic Regression

Logistic regression is a learning algorithm used in a supervised learning problem when the output \(y\) are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data.Logistic regression is modeled using the sigmoid curve which gives outputs between 0 and 1, which is representative of the probability of it being 1. i.e. \(\hat{y} = P(y = 1 | x)\) where \(x\) is our input layer and \(\hat{y}\) is the predicted value and \(0 \leq \hat{y} \leq 1\)

The parameters used in Logistic Regression are:

Sigmoid Function

Loss(Error) and Cost Functions

Loss function computes the error for a single training example (discrepancy between \(\hat{y^{(i)}}\) and \(y^{(i)}\)), whereas the Cost function is the average of the loss function of the entire training set.

Loss Function:

\(L\left(\hat{y}^{(i)}, y^{(i)}\right)=-\left(y^{(i)} \log \left(\hat{y}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{y}^{(i)}\right)\right.\)

Cost Function:

\[J(w, b)=\frac{1}{m} \sum_{i=1}^{m} L\left(\hat{y}^{(i)}, y^{(i)}\right)=-\frac{1}{m} \sum_{i=1}^{m}\left[\left(y^{(i)} \log \left(\hat{y}^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-\hat{y}^{(i)}\right)\right]\right.\]

A good point to note here is that by choosing \(J(w, b)\) as defined above, we have made it a convex function, thus eliminating possibility of multiple optimums. It will be good to remind ourselves of our goal at this point, which is to minimize \(w\) and \(b\).

Gradient Descent

Gradient Descent is an optimization algorithm commonly used in neural networks for updating the weights and biases during the training process.

In a neural network, the goal is to minimize a cost or loss function that measures the discrepancy between the predicted output and the actual output. Gradient Descent iteratively adjusts the network’s parameters (weights and biases) in the direction of steepest descent of the cost function to find its minimum.

The general update rule for Gradient Descent in a neural network can be expressed as follows:

\[\theta_{t+1} = \theta_t - \alpha \cdot \nabla J(\theta_t)\]

where:

To compute the gradient \(\nabla J(\theta_t)\) for the parameters \(\theta_t\), we typically use the backpropagation algorithm, which calculates the gradient recursively layer by layer, starting from the output layer and propagating the errors backward through the network.

Once the gradient is computed, we multiply it by the learning rate \(\alpha\) and subtract the result from the current parameter values \(\theta_t\) to obtain the updated parameter values \(\theta_{t+1}\).

The process of iteratively applying this update rule continues until convergence or until a predefined number of iterations is reached. At convergence, the parameters ideally reach a point where the cost function is minimized, resulting in a well-trained neural network (as seen in the Graph at points A and B).

Gradient Descent Optimization

Computation Graph

A computation graph is a visual representation of the computations performed in a neural network. It breaks down complex mathematical operations into smaller, interconnected nodes, allowing for efficient computation and automatic differentiation.

In a neural network, computations are typically organized into layers consisting of nodes or neurons. Each neuron performs a computation by applying a linear transformation followed by a non-linear activation function.

Consider a simple neural network with two input features \(x_1\) and \(x_2\) and a single output \(y\). The computation graph for this network can be represented as follows: \[\begin{aligned} z &= w_1x_1 + w_2x_2 + b \\ a &= \sigma(z) \\ y &= f(a) \end{aligned}\]

where:

The computation graph provides a clear and structured representation of the computations performed in the neural network. It allows for efficient forward propagation, where inputs flow through the graph to produce outputs, and also facilitates backward propagation, which is used for computing gradients during the training process.

image

image

Forward and Back Propagation

During forward propagation, input values are assigned to the corresponding nodes in the graph, and computations are performed layer by layer until the final output is obtained. During backward propagation (also known as backpropagation), gradients are computed by recursively applying the chain rule to propagate the gradients from the output layer to the input layer. These gradients are then used to update the weights and biases through optimization algorithms such as Gradient Descent, as explained in the previous response.

An example to help visualise it2

An example of a Computation Graph

The computation graph plays a crucial role in neural networks by providing a structured representation of the computations and enabling efficient and accurate training through automatic differentiation.

Vectorization

In deep learning, we deal with very large datasets. Hence, a non-computationally-optimal function can become a huge bottleneck in our algorithm and can result in a model that takes ages to run.

To make sure that our code is computationally efficient, we will use vectorization.

An example to demonstrate the difference between the following implementations of the dot/outer/elementwise product is given in the following pages.

Classical Implementation

import time

x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

### CLASSIC DOT PRODUCT OF VECTORS IMPLEMENTATION ###
tic = time.process_time()
dot = 0

for i in range(len(x1)):
    dot += x1[i] * x2[i]
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000 * (toc - tic)) + "ms")

### CLASSIC OUTER PRODUCT IMPLEMENTATION ###
tic = time.process_time()
outer = np.zeros((len(x1), len(x2))) # we create a len(x1)*len(x2) matrix with only zeros

for i in range(len(x1)):
    for j in range(len(x2)):
        outer[i,j] = x1[i] * x2[j]
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000 * (toc - tic)) + "ms")

### CLASSIC ELEMENTWISE IMPLEMENTATION ###
tic = time.process_time()
mul = np.zeros(len(x1))

for i in range(len(x1)):
    mul[i] = x1[i] * x2[i]
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000 * (toc - tic)) + "ms")

### CLASSIC GENERAL DOT PRODUCT IMPLEMENTATION ###
W = np.random.rand(3,len(x1)) # Random 3*len(x1) numpy array
tic = time.process_time()
gdot = np.zeros(W.shape[0])

for i in range(W.shape[0]):
    for j in range(len(x1)):
        gdot[i] += W[i,j] * x1[j]
toc = time.process_time()
print ("gdot = " + str(gdot) + "\n ----- Computation time = " + str(1000 * (toc - tic)) + "ms")

Output:

dot = 278
 ----- Computation time = 0.1567150000001405ms
outer = [[81. 18. 18. 81.  0. 81. 18. 45.  0.  0. 81. 18. 45.  0.  0.]
 [18.  4.  4. 18.  0. 18.  4. 10.  0.  0. 18.  4. 10.  0.  0.]
 [45. 10. 10. 45.  0. 45. 10. 25.  0.  0. 45. 10. 25.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [63. 14. 14. 63.  0. 63. 14. 35.  0.  0. 63. 14. 35.  0.  0.]
 [45. 10. 10. 45.  0. 45. 10. 25.  0.  0. 45. 10. 25.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [81. 18. 18. 81.  0. 81. 18. 45.  0.  0. 81. 18. 45.  0.  0.]
 [18.  4.  4. 18.  0. 18.  4. 10.  0.  0. 18.  4. 10.  0.  0.]
 [45. 10. 10. 45.  0. 45. 10. 25.  0.  0. 45. 10. 25.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
 ----- Computation time = 0.24711300000035408ms
elementwise multiplication = [81.  4. 10.  0.  0. 63. 10.  0.  0.  0. 81.  4. 25.  0.  0.]
 ----- Computation time = 0.15952099999960723ms
gdot = [33.67290389 24.88257566 17.1213738 ]
 ----- Computation time = 0.23116099999986872ms

Vectorised Implementation

x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

### VECTORIZED DOT PRODUCT OF VECTORS ###
tic = time.process_time()
dot = np.dot(x1,x2)
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000 * (toc - tic)) + "ms")

### VECTORIZED OUTER PRODUCT ###
tic = time.process_time()
outer = np.outer(x1,x2)
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000 * (toc - tic)) + "ms")

### VECTORIZED ELEMENTWISE MULTIPLICATION ###
tic = time.process_time()
mul = np.multiply(x1,x2)
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED GENERAL DOT PRODUCT ###
tic = time.process_time()
dot = np.dot(W,x1)
toc = time.process_time()
print ("gdot = " + str(dot) + "\n ----- Computation time = " + str(1000 * (toc - tic)) + "ms")

Output:

dot = 278
 ----- Computation time = 0.09885899999995118ms
outer = [[81 18 18 81  0 81 18 45  0  0 81 18 45  0  0]
 [18  4  4 18  0 18  4 10  0  0 18  4 10  0  0]
 [45 10 10 45  0 45 10 25  0  0 45 10 25  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [63 14 14 63  0 63 14 35  0  0 63 14 35  0  0]
 [45 10 10 45  0 45 10 25  0  0 45 10 25  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [81 18 18 81  0 81 18 45  0  0 81 18 45  0  0]
 [18  4  4 18  0 18  4 10  0  0 18  4 10  0  0]
 [45 10 10 45  0 45 10 25  0  0 45 10 25  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]]
 ----- Computation time = 0.1062459999998655ms
elementwise multiplication = [81  4 10  0  0 63 10  0  0  0 81  4 25  0  0]
 ----- Computation time = 0.06695800000011687ms
gdot = [26.70284694 22.76463722 19.81081097]
 ----- Computation time = 0.20383000000001594ms
 

The vectorized implementation is much cleaner and more efficient. For bigger vectors/matrices, the differences in running time become even bigger3.

Implementing Logistic Regression

After going through all the basic components of a neural network, we can combine them in order to create our (basic) neural network model.
The main steps for building a Neural Network are:

  1. Define the model structure (such as number of input features)

  2. Initialize the model’s parameters

  3. Loop:

    1. Calculate current loss (forward propagation)

    2. Calculate current gradient (backward propagation)

    3. Update parameters (gradient descent)

The following equations will help us to implement our model over ‘n’ number of training examples4 \[A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)}) \\\] \[J = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})) \\\] \[\frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T \\\] \[\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})\] where,

Implementation of Neural Networks

\(\mathbf{Reminder:}\) The general methodology to build a Neural Network is to:

  1. Define the neural network structure ( # of input units, # of hidden units, etc).

  2. Initialize the model’s parameters

  3. Loop:

Shallow Neural Networks

Shallow Neural Networks, (also known as single-layer neural networks,) are a type of neural network with only one hidden layer between the input and output layers. These networks are considered “shallow” because they have a limited number of layers compared to deep neural networks, which have multiple hidden layers.

Defining the structure

Define three variables:

Initializing the parameters

Initialise the following parameters with random values :

Forward Propagation

Next, we move on to implement forward propagation using the following equations:5 \[Z^{[1]} = W^{[1]} X + b^{[1]}\] \[A^{[1]} = g^{[1]}(Z^{[1]})\] \[Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}\] \[\hat{Y} = A^{[2]} = g^{[2]}(Z^{[2]})\]

Compute the Cost

Now that we’ve computed \(A^{[2]}\) which contains \(a^{[2](i)}\) for all examples, we can compute the cost function as follows:

\[J = - \frac{1}{m} \sum\limits_{i = 1}^{m} \large{(} \small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large{)} \small\]

Backward Propagation

\[dZ^{[2]} = A^{[2]} - Y\] \[dW^{[2]} = \frac{1}{m}dZ^{[2]}A^{[1]^T}\] \[db^{[2]} = \frac{1}{m}np.sum(dZ^{[2]}, axis = 1, keepdims = True)\] \[dZ^{[1]} = W^{[2]^T}dz^{[2]}*g^{[1]'}(Z^{[1]})\] \[dW^{[1]} = \frac{1}{m}dZ^{[1]}X^T\] \[db^{[1]} = \frac{1}{m}np.sum(dZ^{[1]}, axis = 1, keepdims = True)\]

Update Parameters

Now, let’s use the gradients from the previous step in gradient descent to update the parameters \((W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\) \[\theta = \theta - \alpha\frac{\partial J}{\partial\theta}\] where, \(\alpha\) is the learning rate and \(\theta\) represents a parameter.

Now, we have finally completed building a Neural Network with all the steps mentioned at the start of the section.

In summary, Shallow Neural Networks have a single hidden layer between the input and output layers, making them relatively simpler compared to deep neural networks. Despite their limitations, they can still perform well on certain tasks and serve as a starting point for understanding more complex neural network architectures.

Deep Neural Networks

Deep Neural Networks (DNNs) are a type of neural network with multiple hidden layers between the input and output layers. These networks are called “deep” because they have a greater depth due to the presence of multiple layers, allowing them to learn increasingly complex and abstract representations from the input data.

Initializing the parameters for a L-layer Neural Network

Initialise the following parameters with random values :

Forward Propagation Module

Forward propagation will be implemented for each hidden layer with their corresponding activation functions using the following equations: \[Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\] \[A^{[l]} = g^{[l]}(Z^{[l]}) = g^{[l]}(W^{[l]}A^{[l-1]} +b^{[l]})\] where, \(A^{[0]} = X\).

A point to note; when we implement forward propagation in code, it is a good practice to also store the intermediate values as cache, so that you can directly call those values while implementing back propagation for the same Neural Network.

Compute the (Cross-Entropy) Cost

\[J = - \frac{1}{m} \sum\limits_{i = 1}^{m} \large{(} \small y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[l] (i)}\right) \large{)} \small\]

Backward Propagation Module

\[dW^{[l]} = \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T}\] \[db^{[l]} = \frac{\partial \mathcal{J} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\] \[dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]}\] \[dZ^{[l]} = dA^{[l]} * g^{[l]'}(Z^{[l]}).\]

Update Parameters

Now, let’s update the parameters of the model, using gradient descent: \[W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}\] \[b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}\] where, \(\alpha\) is the learning rate.

With this, we have completed all crucial steps in building a Deep Neural Network.

Deep Neural Networks have shown significant success in various fields, such as image recognition, natural language processing, and reinforcement learning. The additional layers in deep networks allow them to learn hierarchical and abstract features, making them more adept at handling complex and high-dimensional data.

In summary, Deep Neural Networks have multiple hidden layers, making them capable of learning intricate representations from data and achieving state-of-the-art performance in various machine learning tasks.

Meow!

Regularization

Deep Learning models have so much flexibility and capacity that overfitting can be a serious problem, if the training dataset is not big enough. Sure it does well on the training set, but the learned network doesn’t generalize to new examples that it has never seen!

This is where regularization comes in to reduce overfitting by bringing in a penalising method to reduce the complexity of the model.8

L2 Regularization The standard way to avoid overfitting is called L2 Regularization. It consists of appropriately modifying your cost function, from:

\[J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)}\] To: \[J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost}\] where, \(\lambda\) is the regularization parameter.9

Logic behind L2 Regularization Regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function we drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.10

Learning Outcomes

As the project comes to an end, I have achieved the following Learning Outcomes:

  1. Understanding the basic principles and mathematics underlying Neural Nets.

  2. Successfully implementing a Shallow Neural Network from scratch using only basic libraries numpy and matplotlib

  3. Successfully implementing a Deep Neural Network from scratch using only basic libraries numpy and matplotlib

  4. Applying regularization in Deep Learning Models.

image

Concluding Remarks

The depth to which we can proceed in the realm of Deep Learning is truly fascinating and it surely holds an exciting future. But for the purpose of this project, I eventually need to restrict myself somewhere due to the presence of deadlines and time constraints.

Doing this project was truly an enjoyable process and I would like to thank the MnP club for this wonderful opportunity and will be looking forward to participate in SoS in the next year also. I would also like to thank my mentor for the feedback, guidance and resources provided to me.

It was an amazing experience.

Thank you!


  1. Input features vector for an image.
    Images of size \((num_{px}, num_{px}, 3)\), where \(3\) represents the \(3\) \(RGB\) channels, are flattened into single vectors of shape \((num_{px}*num_{px}*3, 1).\)
    Here, \(n_x = num_{px}*num_{px}*3.\)↩︎

  2. A wonderful description of the same can be found here: https://colah.github.io/posts/2015-08-Backprop/↩︎

  3. Note that np.dot() performs a matrix-matrix or matrix-vector multiplication. This is different from np.multiply() and the * operator (which is equivalent to .* in Matlab/Octave), which performs an element-wise multiplication.↩︎

  4. Note In cases involving arithmetic operations, broadcasting in Python using NumPy arrays vectorizes the operation.↩︎

  5. \(g^{[i]}(.)\) is the activation function for the \(i^{th}\) layer.↩︎

  6. Common activation functions include the sigmoid function, the rectified linear unit (ReLU), and the hyperbolic tangent (tanh) function.↩︎

  7. \(n_{l}\) represents the dimension of the \(l^{th}\) layer↩︎

  8. For the purpose of this report, I will only be discussing ‘L2 Regularization’ and will use the term ‘Regularization’ interchangeably.↩︎

  9. Note: The value of \(\lambda\) is a hyperparameter (hence needs to be tuned). L2 regularization makes your decision boundary smoother. If \(\lambda\) is too large, it is also possible to “oversmooth”, resulting in a model with high bias.↩︎

  10. Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system.↩︎