Most of our experiments are carried out on computers with 8 cores. There is an error in Section 4.1 that the machines have 4 cores. They in fact have 8 cores. This makes the CPU version as fast as the GPU version and faster than conventional 1-core versions. This is one reason explaining why in Figure 4, there is very little improvement switching from CPU to GPU. Our implementation also simulates the case that the whole dataset cannot be stored in GPU memory (usually less than 1Gb). In particular, data is sent from CPU to GPU during optimization. This further slows down the GPU implementation. Because of these two reasons, the GPU implementation performs worse than conventionally reported in the literature. A more clever implementation that makes better use of the GPU should improve all SGD-GPU, CG-GPU and L-BFGS-GPU. More details are given in the supplementary document.