Problem 15: The Forward Pass of a Neural Network

Version 1.2

Neural networks are a set of algorithms for pattern recognition. They are loosely inspired by the human brain, and have proven especially useful in data clustering and classification tasks.

In this problem, you will use your knowledge of Python and Numpy to speed-up a common computational kernel that arises in neural networks.

Example: a "fully connected layer" for images. Suppose the data we are analyzing consists of $N$ two-dimensional images of size $H \times W$ pixels each. In a neural network, a typical substep is the evaluation of a "fully connected (FC) layer," which takes the images as input and produces a vector of outputs, with one vector per image.

Mathematically, here is a simplified example of what a typical FC layer calculation might look like. Let $x[k, i, j]$ denote the value (e.g., intensity) of the pixel at location $(i, j)$ of the $k$-th input image. Since there are $N$ images, take the values of $k$ to be in the range of 0 to $N-1$, respectively. And since each image is $H \times W$, take $0 \leq i \leq H-1$ and $0 \leq j \leq W-1$. Next, let $\mathrm{out}[k, l]$ denote an output value in the $l$-th element of a vector associated with image $k$, where $0 \leq l \leq M-1$, for some given value of $M$. Lastly, suppose the specific formula for transforming the input images into this collection of output vectors is given by the formula,

$$ \mathrm{out}[k, l] = b[l] + \sum_{i=0}^{H-1} \sum_{j=0}^{W-1} \left(x[k, i, j] \times w[l, i, j]\right) $$

where $w[l, i, j]$ are "weights" and $b[l]$ are "biases." The process of "training" the neural network from sample data determines these weight and bias parameters, but for this problem, just assume that they are given.

If it's helpful, here is a picture of what this formula is doing for each $(k, l)$ pair: <img src = "fully_connected.png" width = "600">

The baseline implementation. In the code cells below, we define a Python function, FC_naive(x, w, b), that implements the FC layer calculation from above using a straightforward, albeit somewhat naive, method. Your goal is to make this baseline run faster.

To start, first run the next three code cells to estimate the time of the baseline implementation.

In [1]:
import numpy as np
import time
from cse6040bench import benchit

def rel_error(x, y):
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
In [2]:
def FC_naive(x, w, b):
    """
    Inputs:
    - x: A numpy array of images of shape (N, H, W)
    - w: A numpy array of weights of shape (M, H, W)
    - b: A numpy vector of biases of size M

    Returns: 
    - out: a numpy array of shape (N, M)
    """
    N, H, W = x.shape
    M, _, _ = w.shape
    out = np.zeros((N,M))
    for ni in range(N):
        for mi in range(M):
                out[ni,mi] = b[mi]
                for d1 in range(H):
                    for d2 in range(W):
                        out[ni,mi] += x[ni, d1, d2] * w[mi, d1, d2] 
    return out
In [3]:
num_inputs = 50
input_shape = (128, 256)
output_dim = 10

x = np.random.rand(num_inputs, *input_shape)
w = np.random.rand(output_dim, *input_shape)
b = np.random.rand(output_dim)
start_time = time.time ()
out = FC_naive(x, w, b)
elapsed_time = time.time () - start_time
print ("==> Took %g seconds." % elapsed_time)
==> Took 9.1294 seconds.

Exercise 1 (5 points). Let's start by seeing if we can make FC_naive() function faster by rewriting the two innermost loops, i.e., the d1 and d2 loops:

for d1 in range(H):
    for d2 in range(W):
        out[ni, mi] += x[ni, d1, d2] * w[mi, d1, d2]

For this exercise, complete the function two_inner_loops(x_i, w_l, b_j), below, so that it implements the same computation as these two d1 and d2 loops, but is much faster. It should return out[ni, mi]. The input x_i is the i-th image, w_l is the l-th weight matrix, and b_l is the l-th component of the bias vector.

The test code will check your results and benchmark a complete FC layer using the function FC_two_loops(), defined below. You'll see that it calls your two_inner_loops() routine to implement the two innermost loops.

To get credit on this exercise, the resulting execution time of FC_two_loops() must be at least 100 times faster than FC_naive() on the problem sizes being tested below when running on the Vocareum platform. There is no partial credit for smaller speedups. Having said that, a little bit of basic Numpy should go a long way.

In [4]:
def two_inner_loops(x_i, w_l, b_l):
    """
    Inputs:
    - x_i: A numpy array of images of shape (H, W)
    - w_l: A numpy array of weights of shape (H, W)
    - b_l: A float (single number)

    Returns: 
    - out: A float (single number)
    """
    ### BEGIN SOLUTION
    return np.sum(np.multiply(x_i, w_l)) + b_l
    ### END SOLUTION
In [5]:
# Test cell: 'FC_two_loops_1' (5 points)

def FC_two_loops(x, w, b):
    """
    Inputs:
    - x: A numpy array of images of shape (N, H, W)
    - w: A numpy array of weights of shape (M, H, W)
    - b: A numpy vector of biases of size M

    Returns: 
    - out: a numpy array of shape (N, M)
    """
    N, H, W = x.shape
    M, _, _ = w.shape
    out = np.zeros((N,M))
    for ni in range(N):
           for mi in range(M):
                out[ni, mi] = two_inner_loops(x[ni,  :, :], w[mi,  :, :], b[mi])
    return out

num_inputs = 50
input_shape = (128, 256)
output_dim = 10

x = np.random.rand(num_inputs, *input_shape)
w = np.random.rand(output_dim, *input_shape)
b = np.random.rand(output_dim)

print("Checking the correctness of your implementation...")
out_fast = FC_two_loops(x, w, b)
out_naive = FC_naive(x, w, b)
error = rel_error(out_naive, out_fast)
print("==> Output error:", error)
assert error < 1e-12, "The value of your output is incorrect or not accurate enough"
print("==> This level of error is acceptable.")

print("\nBenchmarking your code...")
T_fast = benchit("FC_two_loops(x, w, b)", scope=globals())
elapsed_time_fast = T_fast['mean_time_per_run']
print ("==> Took %g seconds." % elapsed_time_fast)

print("\nBenchmarking the naive code...")
T_naive = benchit("FC_naive(x, w, b)", scope=globals())
elapsed_time_naive = T_naive['mean_time_per_run']
print ("==> Took %g seconds." % elapsed_time_naive)

speed_up = elapsed_time_naive / elapsed_time_fast
print("Speed-up:", speed_up)
assert speed_up >= 100, "The speed-up of your method is less than 100"

print("\n(Passed!)")
Checking the correctness of your implementation...
==> Output error: 9.763613235020374e-15
==> This level of error is acceptable.

Benchmarking your code...
Timing result: (5 trials) x (10 runs) in 1.0306545309140347 secs
==> 0.020613090618280695 secs per run
==> Took 0.0206131 seconds.

Benchmarking the naive code...
Timing result: (5 trials) x (1 runs) in 45.191980345058255 secs
==> 9.03839606901165 secs per run
==> Took 9.0384 seconds.
Speed-up: 438.4784521830006

(Passed!)

Question 2 (5 points). Now, completely rewrite the FC_naive() function by at least 1,200 times.

This improvement can be attained with basic Numpy operations that you've learned (i.e., no "new" functions) and no explicit loops.

In [6]:
def FC_no_loop(x, w, b):
    """
    Inputs:
    - x: A numpy array of images of shape (N, H, W)
    - w: A numpy array of weights of shape (M, H, W)
    - b: A numpy vector of biases of size M

    Returns: 
    - out: a numpy array of shape (N, M)
    """
    N, H, W = x.shape
    M, _, _ = w.shape
    out = np.zeros((N, M))
    ### BEGIN SOLUTION
    x = np.reshape(x, (N, H*W))
    w = np.reshape(w, (M, H*W))
    out = x.dot(w.T) + b
    
    # The following one-line solution, which Googling might give you, will work,
    # but it's likely it won't be fast enough in its current implementation.
    # out = np.einsum("nhw,mhw->nm", x, w) + b
    ### END SOLUTION
    return out
In [7]:
# Test cell: 'FC_no_loop' (5 points)
num_inputs = 50
input_shape = (128, 256)
output_dim = 10

x = np.random.rand(num_inputs, *input_shape)
w = np.random.rand(output_dim, *input_shape)
b = np.random.rand(output_dim)

print("Checking the correctness of your implementation...")
out_fast = FC_no_loop(x, w, b)
out_naive = FC_naive(x, w, b)
error = rel_error(out_naive, out_fast)
print("==> Output error:", error)
assert error < 1e-12, "The value of your output is incorrect or not accurate enough"
print("==> This level of error is acceptable.")

print("\nBenchmarking your code...")
T_fast = benchit("FC_no_loop(x, w, b)", scope=globals())
elapsed_time_fast = T_fast['mean_time_per_run']
print ("==> Took %g seconds." % elapsed_time_fast)

print("\nBenchmarking the naive code...")
T_naive = benchit("FC_naive(x, w, b)", scope=globals())
elapsed_time_naive = T_naive['mean_time_per_run']
print ("==> Took %g seconds." % elapsed_time_naive)

speed_up = elapsed_time_naive / elapsed_time_fast
print("Speed-up:", speed_up)
assert speed_up >= 1200, "The speed-up of your method is less than 1200"

print("\n(Passed!)")
Checking the correctness of your implementation...
==> Output error: 6.5240027695320964e-15
==> This level of error is acceptable.

Benchmarking your code...
Timing result: (5 trials) x (100 runs) in 2.973538712016307 secs
==> 0.005947077424032613 secs per run
==> Took 0.00594708 seconds.

Benchmarking the naive code...
Timing result: (5 trials) x (1 runs) in 45.34813046990894 secs
==> 9.069626093981787 secs per run
==> Took 9.06963 seconds.
Speed-up: 1525.0559976452814

(Passed!)

Fin! You've reached the end of this problem. Don't forget to restart the kernel and run the entire notebook from top-to-bottom to make sure you did everything correctly. If that is working, try submitting this problem. (Recall that you must submit and pass the autograder to get credit for your work!)