# Stochastic Gradient Descent

## Problem Description

This benchmark implements Stochastic Gradient Descent (SGD). In particular, the benchmark uses SGD to complete unknown entries of a sparse matrix
The goal of the benchmark is to build two matrices, *A* (*1xm*) and *B* (*1xu*), such that the matrix product *AB* approximately equals *R*, the original ratings matrix described above.
Each entry of *A* and *B* is a vector of real numbers and is called a *latent vector*.

After *A* and *B* have been calculated, they can be used to predict unknown entries of *R*.

## Algorithm

First, the entries of the latent vector are initialized to random numbers. Then the ratings are normalized.
Then for every edge in the graph *R*, we apply stochastic gradient descent as follows:

function gradient_update(edge) { Mlv = edge.src.lv Ulv = edge.dst.lv diff = edge.rating - dotproduct(Mlv, Ulv) update(Mlv, diff) update(Ulv, diff) }

The key problem in a parallel implementation is to ensure that the `gradient_update` procedure has exclusive access to the end-points of the edge.
Although this is conceptually simple, the benchmarks show that different locking strategies lead to different schedules, which in turn lead to very wide differences in performance.
This is especially the case when applied to scale-free graphs.

For more information on this benchmark, see:

Rashid Kaleem, Anand Venkat, Sreepathi Pai, Mary Hall, Keshav Pingali, "Synchronization Trade-offs in GPU implementations of Graph Algorithms", IPDPS, 2016.The benchmark is set to run for 5 rounds. This benchmark has rough correspondence to the GPU implementations described in the paper above.

## Performance

Graph | Time (ms) |
---|---|

Netflix | 5000 |

The benchmark is written in CUDA.

Compile as:

`make`

Execute as:

`blk_sgd <bipartite-graph-input>`

e.g.,

`blk_diag bgg.gr`

The above performance is observed on a 1733.500 Mhz GeForce GTX 1080 6000 with 8119 MB of main memory.