a lightweight neural network library for sparse data
With the deluge of deep learning libraries and software innovation in the field over the last few years, it is an exciting time to be working on machine learning problems. Most of the libraries available evolved from fairly specialized computational code for large dense problems such as image classification into general frameworks for neural-network-based models offering marginal support for sparse models.
At Netflix, our machine learning scientists deal with a wide variety of problems across a broad spectrum of areas: from tailoring TV and movie recommendations to your taste to optimizing encoding algorithms. A subset of our problems involve dealing with extremely sparse data; the total dimensionality of the problem at hand can easily reach tens of millions of features, even though every observation may only have a handful of non-zero entries. For these cases, we felt the need for a minimalist library that is specifically optimized for training shallow feedforward neural nets on sparse data in a single-machine, multi-core environment. We wanted something small and easy to hack, so we built Vectorflow, one of the many tools our machine learning scientists use.
- Agility. We want our data scientists to easily run and iterate on their models in total autonomy. So we wrote Vectorflow in D, a modern systems language with a very gentle learning curve. Thanks to its fast compilers and functional programming features, it offers a Python-like experience for beginners but with typically multiple orders of magnitude of performance gain at run-time, while enabling seasoned developers to leverage an excellent templating engine, compile-time functionalities and lower-level features (C interface, inline assembler, manual memory management, auto-vectorization, …). Vectorflow does not have any third-party dependencies, which eases its deployment. It offers a callback-based API to easily plug-in custom loss functions for training.
- Sparse-aware. Designing the library for sparse data and shallow architectures implies that the runtime bottleneck will tend to be IO: there will be relatively few operations to run per row, contrary, for example, to a convolutional layer on a large dense matrix. Vectorflow avoids, wherever possible, copying or allocating any memory during both the forward and backward passes, with each layer referencing the data it needs from its parents and children. Matrix-vector operations have both sparse and dense implementations, the latter ones being SIMD-vectorized. Vectorflow also offers a way to run a sparse backpropagation when dealing with sparse output gradients.
- IO agnostic. If you are IO bound, by definition the trainer will run only as fast as your IO layer. Vectorflow enforces very loose requirements on the underlying data schema (merely to provide an iterator of rows with a “features” attribute) so that one can write efficient data adapters based on the data source and avoid any pre-processing or data conversion steps while sticking with the same programming language. This allows you to move the code to the data, not the opposite.
- Single-machine. Distributed systems are hard to debug and introduce fixed costs such as job scheduling. Implementing distributed optimization of a novel machine learning technique is even harder. This is why we created an efficient solution in a single machine setting, lowering iteration time of modeling without sacrificing scalability for small to medium scale problems (100 million rows). We opted for generic asynchronous SGD solvers using Hogwild as a lock-free strategy to distribute the load over the cores with no communication cost. This works for most linear or shallow net models as long as the data is sufficiently sparse, and avoids having to think about the distributed aspect of the algorithm, since everything works as in a non-distributed case from a user perspective.
A few months after the project’s inception, we’ve seen a wide variety of use cases for the library and multiple research projects and production systems are now using Vectorflow for problems as diverse as causal inference, survival regression, density estimation or ranking algorithms for recommendation. In fact, we’re testing using Vectorflow to power part of the Netflix home page experience. It is also included in the default toolbox installed on basic instances used by Netflix machine learning practitioners.
As an example, we investigate the performance of the library on a marketing problem Netflix faces related to promoting our originals. In this case, we want to perform weighted Maximum Likelihood Estimation with a survival exponential distribution. To implement this, the custom callback function passed to Vectorflow is:
Using this callback for training, we can easily compare 3 models:
- Model 1: linear model on a tiny set of sparse features (~500 parameters to learn)
- Model 2: linear model on a larger sparse set of features (1M parameters to learn)
- Model 3: shallow neural network on a sparse set of features (10M parameters to learn), trained on twice the data
The data source is a Hive table stored on S3 using the columnar data format Parquet and we train directly against this data by streaming it to a c4.4xlarge instance and building in-memory the training set which we learn from.
The results are as follows:
Both decompression and feature encoding happen on a single thread so there is room for improvement, but the end-to-end runtime demonstrates that there is no need for a distributed solution for medium-scale sparse datasets and shallow architectures. Notice that the training time scales somewhat linearly with the sparsity of the data as well as the number of rows. One reason preventing linear scalability is that CPU memory hierarchy will create cache invalidation when multiple asynchronous SGD threads access the same weights, hence breaking the theoretical results of Hogwild if the model parameters access pattern is not sparse enough (see this paper for details).
In the future, we plan to broaden the possible topologies supported beyond simple linear, polynomial or feedforward architectures, develop more specialized layers (such as recurrent cells) and explore new parallelism strategies while maintaining the “minimalist” philosophy of Vectorflow.