init the faiss for rocm

395d2ce6 · huchen · 5ded39f5 · 395d2ce6 · 395d2ce6 · 395d2ce6
Commit 395d2ce6 authored Jun 01, 2022 by huchen
20 changed files
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project DOES NOT adhere to [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
+at the moment.
+We try to indicate most contributions here with the contributor names who are not part of
+the Facebook Faiss team.  Feel free to add entries here if you submit a PR.
+## [Unreleased]
+## [1.7.2] - 2021-12-15
+### Added
+- Support LSQ on GPU (by @KinglittleQ)
+- Support for exact 1D kmeans (by @KinglittleQ)
+## [1.7.1] - 2021-05-27
+### Added
+- Support for building C bindings through the `FAISS_ENABLE_C_API` CMake option.
+- Serializing the indexes with the python pickle module
+- Support for the NNDescent k-NN graph building method (by @KinglittleQ)
+- Support for the NSG graph indexing method (by @KinglittleQ)
+- Residual quantizers: support as codec and unoptimized search
+- Support for 4-bit PQ implementation for ARM (by @vorj, @n-miyamoto-fixstars, @LWisteria, and @matsui528)
+- Implementation of Local Search Quantization (by @KinglittleQ)
+### Changed
+- The order of xb an xq was different between `faiss.knn` and `faiss.knn_gpu`.
+Also the metric argument was called distance_type.
+- The typed vectors (LongVector, LongLongVector, etc.) of the SWIG interface have
+been deprecated. They have been replaced with Int32Vector, Int64Vector, etc. (by h-vetinari)
+### Fixed
+- Fixed a bug causing kNN search functions for IndexBinaryHash and
+IndexBinaryMultiHash to return results in a random order.
+- Copy constructor of AlignedTable had a bug leading to crashes when cloning
+IVFPQ indices.
+## [1.7.0] - 2021-01-27
+## [1.6.5] - 2020-11-22
+## [1.6.4] - 2020-10-12
+### Added
+- Arbitrary dimensions per sub-quantizer now allowed for `GpuIndexIVFPQ`.
+- Brute-force kNN on GPU (`bfKnn`) now accepts `int32` indices.
+- Nightly conda builds now available (for CPU).
+- Faiss is now supported on Windows.
+## [1.6.3] - 2020-03-24
+### Added
+- Support alternative distances on GPU for GpuIndexFlat, including L1, Linf and
+Lp metrics.
+- Support METRIC_INNER_PRODUCT for GpuIndexIVFPQ.
+- Support float16 coarse quantizer for GpuIndexIVFFlat and GpuIndexIVFPQ. GPU
+Tensor Core operations (mixed-precision arithmetic) are enabled on supported
+hardware when operating with float16 data.
+- Support k-means clustering with encoded vectors. This makes it possible to
+train on larger datasets without decompressing them in RAM, and is especially
+useful for binary datasets (see https://github.com/facebookresearch/faiss/blob/main/tests/test_build_blocks.py#L92).
+- Support weighted k-means. Weights can be associated to each training point
+(see https://github.com/facebookresearch/faiss/blob/main/tests/test_build_blocks.py).
+- Serialize callback in python, to write to pipes or sockets (see
+https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning).
+- Reconstruct arbitrary ids from IndexIVF + efficient remove of a small number
+of ids. This avoids 2 inefficiencies: O(ntotal) removal of vectors and
+IndexIDMap2 on top of indexIVF. Documentation here:
+https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes.
+- Support inner product as a metric in IndexHNSW (see
+https://github.com/facebookresearch/faiss/blob/main/tests/test_index.py#L490).
+- Support PQ of sizes other than 8 bit in IndexIVFPQ.
+- Demo on how to perform searches sequentially on an IVF index. This is useful
+for an OnDisk index with a very large batch of queries. In that case, it is
+worthwhile to scan the index sequentially (see
+https://github.com/facebookresearch/faiss/blob/main/tests/test_ivflib.py#L62).
+- Range search support for most binary indexes.
+- Support for hashing-based binary indexes (see
+https://github.com/facebookresearch/faiss/wiki/Binary-indexes).
+### Changed
+- Replaced obj table in Clustering object: now it is a ClusteringIterationStats
+structure that contains additional statistics.
+### Removed
+- Removed support for useFloat16Accumulator for accumulators on GPU (all
+accumulations are now done in float32, regardless of whether float16 or float32
+input data is used).
+### Fixed
+- Some python3 fixes in benchmarks.
+- Fixed GpuCloner (some fields were not copied, default to no precomputed tables
+with IndexIVFPQ).
+- Fixed support for new pytorch versions.
+- Serialization bug with alternative distances.
+- Removed test on multiple-of-4 dimensions when switching between blas and AVX
+implementations.
+## [1.6.2] - 2020-03-10
+## [1.6.1] - 2019-12-04
+## [1.6.0] - 2019-09-24
+### Added
+- Faiss as a codec: We introduce a new API within Faiss to encode fixed-size
+vectors into fixed-size codes. The encoding is lossy and the tradeoff between
+compression and reconstruction accuracy can be adjusted.
+- ScalarQuantizer support for GPU, see gpu/GpuIndexIVFScalarQuantizer.h. This is
+particularly useful as GPU memory is often less abundant than CPU.
+- Added easy-to-use serialization functions for indexes to byte arrays in Python
+(faiss.serialize_index, faiss.deserialize_index).
+- The Python KMeans object can be used to use the GPU directly, just add
+gpu=True to the constuctor see gpu/test/test_gpu_index.py test TestGPUKmeans.
+### Changed
+- Change in the code layout: many C++ sources are now in subdirectories impl/
+and utils/.
+## [1.5.3] - 2019-06-24
+### Added
+- Basic support for 6 new metrics in CPU IndexFlat and IndexHNSW (https://github.com/facebookresearch/faiss/issues/848).
+- Support for IndexIDMap/IndexIDMap2 with binary indexes (https://github.com/facebookresearch/faiss/issues/780).
+### Changed
+- Throw python exception for OOM (https://github.com/facebookresearch/faiss/issues/758).
+- Make DistanceComputer available for all random access indexes.
+- Gradually moving from long to uint64_t for portability.
+### Fixed
+- Slow scanning of inverted lists (https://github.com/facebookresearch/faiss/issues/836).
+## [1.5.2] - 2019-05-28
+### Added
+- Support for searching several inverted lists in parallel (parallel_mode != 0).
+- Better support for PQ codes where nbit != 8 or 16.
+- IVFSpectralHash implementation: spectral hash codes inside an IVF.
+- 6-bit per component scalar quantizer (4 and 8 bit were already supported).
+- Combinations of inverted lists: HStackInvertedLists and VStackInvertedLists.
+- Configurable number of threads for OnDiskInvertedLists prefetching (including
+0=no prefetch).
+- More test and demo code compatible with Python 3 (print with parentheses).
+### Changed
+- License was changed from BSD+Patents to MIT.
+- Exceptions raised in sub-indexes of IndexShards and IndexReplicas are now
+propagated.
+- Refactored benchmark code: data loading is now in a single file.
+## [1.5.1] - 2019-04-05
+### Added
+- MatrixStats object, which reports useful statistics about a dataset.
+- Option to round coordinates during k-means optimization.
+- An alternative option for search in HNSW.
+- Support for range search in IVFScalarQuantizer.
+- Support for direct uint_8 codec in ScalarQuantizer.
+- Better support for PQ code assignment with external index.
+- Support for IMI2x16 (4B virtual centroids).
+- Support for k = 2048 search on GPU (instead of 1024).
+- Support for renaming an ondisk invertedlists.
+- Support for nterrupting computations with interrupt signal (ctrl-C) in python.
+- Simplified build system (with --with-cuda/--with-cuda-arch options).
+### Changed
+- Moved stats() and imbalance_factor() from IndexIVF to InvertedLists object.
+- Renamed IndexProxy to IndexReplicas.
+- Most CUDA mem alloc failures now throw exceptions instead of terminating on an
+assertion.
+- Updated example Dockerfile.
+- Conda packages now depend on the cudatoolkit packages, which fixes some
+interferences with pytorch. Consequentially, faiss-gpu should now be installed
+by conda install -c pytorch faiss-gpu cudatoolkit=10.0.
+## [1.5.0] - 2018-12-19
+### Added
+- New GpuIndexBinaryFlat index.
+- New IndexBinaryHNSW index.
+## [1.4.0] - 2018-08-30
+### Added
+- Automatic tracking of C++ references in Python.
+- Support for non-intel platforms, some functions optimized for ARM.
+- Support for overriding nprobe for concurrent searches.
+- Support for floating-point quantizers in binary indices.
+### Fixed
+- No more segfaults due to Python's GC.
+- GpuIndexIVFFlat issues for float32 with 64 / 128 dims.
+- Sharding of flat indexes on GPU with index_cpu_to_gpu_multiple.
+## [1.3.0] - 2018-07-10
+### Added
+- Support for binary indexes (IndexBinaryFlat, IndexBinaryIVF).
+- Support fp16 encoding in scalar quantizer.
+- Support for deduplication in IndexIVFFlat.
+- Support for index serialization.
+### Fixed
+- MMAP bug for normal indices.
+- Propagation of io_flags in read func.
+- k-selection for CUDA 9.
+- Race condition in OnDiskInvertedLists.
+## [1.2.1] - 2018-02-28
+### Added
+- Support for on-disk storage of IndexIVF data.
+- C bindings.
+- Extended tutorial to GPU indices.
+[Unreleased]: https://github.com/facebookresearch/faiss/compare/v1.7.2...HEAD
+[1.7.2]: https://github.com/facebookresearch/faiss/compare/v1.7.1...v1.7.2
+[1.7.1]: https://github.com/facebookresearch/faiss/compare/v1.7.0...v1.7.1
+[1.7.0]: https://github.com/facebookresearch/faiss/compare/v1.6.5...v1.7.0
+[1.6.5]: https://github.com/facebookresearch/faiss/compare/v1.6.4...v1.6.5
+[1.6.4]: https://github.com/facebookresearch/faiss/compare/v1.6.3...v1.6.4
+[1.6.3]: https://github.com/facebookresearch/faiss/compare/v1.6.2...v1.6.3
+[1.6.2]: https://github.com/facebookresearch/faiss/compare/v1.6.1...v1.6.2
+[1.6.1]: https://github.com/facebookresearch/faiss/compare/v1.6.0...v1.6.1
+[1.6.0]: https://github.com/facebookresearch/faiss/compare/v1.5.3...v1.6.0
+[1.5.3]: https://github.com/facebookresearch/faiss/compare/v1.5.2...v1.5.3
+[1.5.2]: https://github.com/facebookresearch/faiss/compare/v1.5.1...v1.5.2
+[1.5.1]: https://github.com/facebookresearch/faiss/compare/v1.5.0...v1.5.1
+[1.5.0]: https://github.com/facebookresearch/faiss/compare/v1.4.0...v1.5.0
+[1.4.0]: https://github.com/facebookresearch/faiss/compare/v1.3.0...v1.4.0
+[1.3.0]: https://github.com/facebookresearch/faiss/compare/v1.2.1...v1.3.0
+[1.2.1]: https://github.com/facebookresearch/faiss/releases/tag/v1.2.1
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+cmake_minimum_required(VERSION 3.17 FATAL_ERROR)
+project(faiss
+  VERSION 1.6.4
+  DESCRIPTION "A library for efficient similarity search and clustering of dense vectors."
+  HOMEPAGE_URL "https://github.com/facebookresearch/faiss"
+  LANGUAGES CXX)
+include(GNUInstallDirs)
+set(CMAKE_CXX_STANDARD 11)
+list(APPEND CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake")
+# Valid values are "generic", "avx2".
+option(FAISS_OPT_LEVEL "" "generic")
+option(FAISS_ENABLE_GPU "Enable support for GPU indexes." ON)
+option(FAISS_ENABLE_PYTHON "Build Python extension." ON)
+option(FAISS_ENABLE_C_API "Build C API." OFF)
+# HC
+if(FAISS_ENABLE_GPU)
+  set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER})
+  find_package(HIP)
+  #enable_language(CUDA)
+endif()
+add_subdirectory(faiss)
+if(FAISS_ENABLE_GPU)
+  add_subdirectory(faiss/gpu)
+endif()
+if(FAISS_ENABLE_PYTHON)
+  add_subdirectory(faiss/python)
+endif()
+if(FAISS_ENABLE_C_API)
+  add_subdirectory(c_api)
+endif()
+add_subdirectory(demos)
+add_subdirectory(tutorial/cpp)
+# CTest must be included in the top level to enable `make test` target.
+include(CTest)
+if(BUILD_TESTING)
+  add_subdirectory(tests)
+  if(FAISS_ENABLE_GPU)
+    add_subdirectory(faiss/gpu/test)
+  endif()
+endif()
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Code of Conduct
+Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please [read the full text](https://code.fb.com/codeofconduct) so that you can understand what actions will and will not be tolerated.
\ No newline at end of file
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+# Contributing to Faiss
+We want to make contributing to this project as easy and transparent as
+possible.
+## Our Development Process
+We mainly develop Faiss within Facebook. Sometimes, we will sync the
+github version of Faiss with the internal state.
+## Pull Requests
+We welcome pull requests that add significant value to Faiss. If you plan to do
+a major development and contribute it back to Faiss, please contact us first before
+putting too much effort into it.
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+There is a Facebook internal test suite for Faiss, and we need to run
+all changes to Faiss through it.
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+## Coding Style
+* 4 or 2 spaces for indentation in C++ (no tabs)
+* 80 character line length (both for C++ and Python)
+* C++ language level: C++11
+## License
+By contributing to Faiss, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.
--- a/Dockerfile
+++ b/Dockerfile
+FROM nvidia/cuda:8.0-devel-centos7
+# Install MKL
+RUN yum-config-manager --add-repo https://yum.repos.intel.com/mkl/setup/intel-mkl.repo
+RUN rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
+RUN yum install -y intel-mkl-2019.3-062
+ENV LD_LIBRARY_PATH /opt/intel/mkl/lib/intel64:$LD_LIBRARY_PATH
+ENV LIBRARY_PATH /opt/intel/mkl/lib/intel64:$LIBRARY_PATH
+ENV LD_PRELOAD /usr/lib64/libgomp.so.1:/opt/intel/mkl/lib/intel64/libmkl_def.so:\
+/opt/intel/mkl/lib/intel64/libmkl_avx2.so:/opt/intel/mkl/lib/intel64/libmkl_core.so:\
+/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so:/opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so
+# Install necessary build tools
+RUN yum install -y gcc-c++ make swig3
+# Install necesary headers/libs
+RUN yum install -y python-devel numpy
+COPY . /opt/faiss
+WORKDIR /opt/faiss
+# --with-cuda=/usr/local/cuda-8.0 
+RUN ./configure --prefix=/usr --libdir=/usr/lib64 --without-cuda
+RUN make -j $(nproc)
+RUN make -C python
+RUN make test
+RUN make install
+RUN make -C demos demo_ivfpq_indexing && ./demos/demo_ivfpq_indexing
--- a/Doxyfile
+++ b/Doxyfile
--- a/INSTALL.md
+++ b/INSTALL.md
+# Installing Faiss via conda
+The recommended way to install Faiss is through [conda](https://docs.conda.io).
+Stable releases are pushed regularly to the pytorch conda channel, as well as
+pre-release nightly builds.
+The CPU-only `faiss-cpu` conda package is currently available on Linux, OSX, and
+Windows. The `faiss-gpu`, containing both CPU and GPU indices, is available on
+Linux systems, for various versions of CUDA.
+To install the latest stable release:
+``` shell
+# CPU-only version
+$ conda install -c pytorch faiss-cpu
+# GPU(+CPU) version
+$ conda install -c pytorch faiss-gpu
+# or for a specific CUDA version
+$ conda install -c pytorch faiss-gpu cudatoolkit=10.2 # for CUDA 10.2
+```
+Nightly pre-release packages can be installed as follows:
+``` shell
+# CPU-only version
+$ conda install -c pytorch/label/nightly faiss-cpu
+# GPU(+CPU) version
+$ conda install -c pytorch/label/nightly faiss-gpu
+```
+## Installing from conda-forge
+Faiss is also being packaged by [conda-forge](https://conda-forge.org/), the
+community-driven packaging ecosystem for conda. The packaging effort is
+collaborating with the Faiss team to ensure high-quality package builds.
+Due to the comprehensive infrastructure of conda-forge, it may even happen that
+certain build combinations are supported in conda-forge that are not available
+through the pytorch channel. To install, use
+``` shell
+# CPU version
+$ conda install -c conda-forge faiss-cpu
+# GPU version
+$ conda install -c conda-forge faiss-gpu
+```
+You can tell which channel your conda packages come from by using `conda list`.
+If you are having problems using a package built by conda-forge, please raise
+an [issue](https://github.com/conda-forge/faiss-split-feedstock/issues) on the
+conda-forge package "feedstock".
+# Building from source
+Faiss can be built from source using CMake.
+Faiss is supported on x86_64 machines on Linux, OSX, and Windows. It has been
+found to run on other platforms as well, see
+[other platforms](https://github.com/facebookresearch/faiss/wiki/Related-projects#bindings-to-other-languages-and-porting-to-other-platforms).
+The basic requirements are:
+- a C++11 compiler (with support for OpenMP support version 2 or higher),
+- a BLAS implementation (we strongly recommend using Intel MKL for best
+performance).
+The optional requirements are:
+- for GPU indices:
+  - nvcc,
+  - the CUDA toolkit,
+- for the python bindings:
+  - python 3,
+  - numpy,
+  - and swig.
+Indications for specific configurations are available in the [troubleshooting
+section of the wiki](https://github.com/facebookresearch/faiss/wiki/Troubleshooting).
+## Step 1: invoking CMake
+``` shell
+$ cmake -B build .
+```
+This generates the system-dependent configuration/build files in the `build/`
+subdirectory.
+Several options can be passed to CMake, among which:
+- general options:
+  - `-DFAISS_ENABLE_GPU=OFF` in order to disable building GPU indices (possible
+  values are `ON` and `OFF`),
+  - `-DFAISS_ENABLE_PYTHON=OFF` in order to disable building python bindings
+  (possible values are `ON` and `OFF`),
+  - `-DBUILD_TESTING=OFF` in order to disable building C++ tests,
+  - `-DBUILD_SHARED_LIBS=ON` in order to build a shared library (possible values
+  are `ON` and `OFF`),
+- optimization-related options:
+  - `-DCMAKE_BUILD_TYPE=Release` in order to enable generic compiler
+  optimization options (enables `-O3` on gcc for instance),
+  - `-DFAISS_OPT_LEVEL=avx2` in order to enable the required compiler flags to
+  generate code using optimized SIMD instructions (possible values are `generic`,
+  `sse4`, and `avx2`, by increasing order of optimization),
+- BLAS-related options:
+  - `-DBLA_VENDOR=Intel10_64_dyn -DMKL_LIBRARIES=/path/to/mkl/libs` to use the
+  Intel MKL BLAS implementation, which is significantly faster than OpenBLAS
+  (more information about the values for the `BLA_VENDOR` option can be found in
+  the [CMake docs](https://cmake.org/cmake/help/latest/module/FindBLAS.html)),
+- GPU-related options:
+  - `-DCUDAToolkit_ROOT=/path/to/cuda-10.1` in order to hint to the path of
+  the CUDA toolkit (for more information, see
+  [CMake docs](https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html)),
+  - `-DCMAKE_CUDA_ARCHITECTURES="75;72"` for specifying which GPU architectures
+  to build against (see [CUDA docs](https://developer.nvidia.com/cuda-gpus) to
+  determine which architecture(s) you should pick),
+- python-related options:
+  - `-DPython_EXECUTABLE=/path/to/python3.7` in order to build a python
+  interface for a different python than the default one (see
+  [CMake docs](https://cmake.org/cmake/help/latest/module/FindPython.html)).
+## Step 2: Invoking Make
+``` shell
+$ make -C build -j faiss
+```
+This builds the C++ library (`libfaiss.a` by default, and `libfaiss.so` if
+`-DBUILD_SHARED_LIBS=ON` was passed to CMake).
+The `-j` option enables parallel compilation of multiple units, leading to a
+faster build, but increasing the chances of running out of memory, in which case
+it is recommended to set the `-j` option to a fixed value (such as `-j4`).
+## Step 3: Building the python bindings (optional)
+``` shell
+$ make -C build -j swigfaiss
+$ (cd build/faiss/python && python setup.py install)
+```
+The first command builds the python bindings for Faiss, while the second one
+generates and installs the python package.
+## Step 4: Installing the C++ library and headers (optional)
+``` shell
+$ make -C build install
+```
+This will make the compiled library (either `libfaiss.a` or `libfaiss.so` on
+Linux) available system-wide, as well as the C++ headers. This step is not
+needed to install the python package only.
+## Step 5: Testing (optional)
+### Running the C++ test suite
+To run the whole test suite, make sure that `cmake` was invoked with
+`-DBUILD_TESTING=ON`, and run:
+``` shell
+$ make -C build test
+```
+### Running the python test suite
+``` shell
+$ (cd build/faiss/python && python setup.py build)
+$ PYTHONPATH="$(ls -d ./build/faiss/python/build/lib*/)" pytest tests/test_*.py
+```
+### Basic example
+A basic usage example is available in
+[`demos/demo_ivfpq_indexing.cpp`](https://github.com/facebookresearch/faiss/blob/main/demos/demo_ivfpq_indexing.cpp).
+It creates a small index, stores it and performs some searches. A normal runtime
+is around 20s. With a fast machine and Intel MKL's BLAS it runs in 2.5s.
+It can be built with
+``` shell
+$ make -C build demo_ivfpq_indexing
+```
+and subsequently ran with
+``` shell
+$ ./build/demos/demo_ivfpq_indexing
+```
+### Basic GPU example
+``` shell
+$ make -C build demo_ivfpq_indexing_gpu
+$ ./build/demos/demo_ivfpq_indexing_gpu
+```
+This produce the GPU code equivalent to the CPU `demo_ivfpq_indexing`. It also
+shows how to translate indexes from/to a GPU.
+### A real-life benchmark
+A longer example runs and evaluates Faiss on the SIFT1M dataset. To run it,
+please download the ANN_SIFT1M dataset from http://corpus-texmex.irisa.fr/
+and unzip it to the subdirectory `sift1M` at the root of the source
+directory for this repository.
+Then compile and run the following (after ensuring you have installed faiss):
+``` shell
+$ make -C build demo_sift1M
+$ ./build/demos/demo_sift1M
+```
+This is a demonstration of the high-level auto-tuning API. You can try
+setting a different index_key to find the indexing structure that
+gives the best performance.
+### Real-life test
+The following script extends the demo_sift1M test to several types of
+indexes. This must be run from the root of the source directory for this
+repository:
+``` shell
+$ mkdir tmp  # graphs of the output will be written here
+$ python demos/demo_auto_tune.py
+```
+It will cycle through a few types of indexes and find optimal
+operating points. You can play around with the types of indexes.
+### Real-life test on GPU
+The example above also runs on GPU. Edit `demos/demo_auto_tune.py` at line 100
+with the values
+``` python
+keys_to_test = keys_gpu
+use_gpu = True
+```
+and you can run
+``` shell
+$ python demos/demo_auto_tune.py
+```
+to test the GPU code.
--- a/LICENSE
+++ b/LICENSE
+MIT License
+Copyright (c) Facebook, Inc. and its affiliates.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
 # Faiss
+Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by [Facebook AI Research](https://research.fb.com/category/facebook-ai-research-fair/).
+## News
+See [CHANGELOG.md](CHANGELOG.md) for detailed information about latest features.
+## Introduction
+Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors.
+Most of the methods, like those based on binary vectors and compact quantization codes, solely use a compressed representation of the vectors and do not require to keep the original vectors. This generally comes at the cost of a less precise search but these methods can scale to billions of vectors in main memory on a single server.
+The GPU implementation can accept input from either CPU or GPU memory. On a server with GPUs, the GPU indexes can be used a drop-in replacement for the CPU indexes (e.g., replace `IndexFlatL2` with `GpuIndexFlatL2`) and copies to/from GPU memory are handled automatically. Results will be faster however if both input and output remain resident on the GPU. Both single and multi-GPU usage is supported.
+## Building
+The library is mostly implemented in C++, with optional GPU support provided via CUDA, and an optional Python interface. The CPU version requires a BLAS library. It compiles with a Makefile and can be packaged in a docker image. See [INSTALL.md](INSTALL.md) for details.
+## How Faiss works
+Faiss is built around an index type that stores a set of vectors, and provides a function to search in them with L2 and/or dot product vector comparison. Some index types are simple baselines, such as exact search. Most of the available indexing structures correspond to various trade-offs with respect to
+- search time
+- search quality
+- memory used per index vector
+- training time
+- need for external data for unsupervised training
+The optional GPU implementation provides what is likely (as of March 2017) the fastest exact and approximate (compressed-domain) nearest neighbor search implementation for high-dimensional vectors, fastest Lloyd's k-means, and fastest small k-selection algorithm known. [The implementation is detailed here](https://arxiv.org/abs/1702.08734).
+## Full documentation of Faiss
+The following are entry points for documentation:
+- the full documentation, including a [tutorial](https://github.com/facebookresearch/faiss/wiki/Getting-started), a [FAQ](https://github.com/facebookresearch/faiss/wiki/FAQ) and a [troubleshooting section](https://github.com/facebookresearch/faiss/wiki/Troubleshooting) can be found on the [wiki page](http://github.com/facebookresearch/faiss/wiki)
+- the [doxygen documentation](https://facebookresearch.github.io/faiss) gives per-class information
+- to reproduce results from our research papers, [Polysemous codes](https://arxiv.org/abs/1609.01882) and [Billion-scale similarity search with GPUs](https://arxiv.org/abs/1702.08734), refer to the [benchmarks README](benchs/README.md). For [
+Link and code: Fast indexing with graphs and compact regression codes](https://arxiv.org/abs/1804.09996), see the [link_and_code README](benchs/link_and_code)
+## Authors
+The main authors of Faiss are:
+- [Hervé Jégou](https://github.com/jegou) initiated the Faiss project and wrote its first implementation
+- [Matthijs Douze](https://github.com/mdouze) implemented most of the CPU Faiss
+- [Jeff Johnson](https://github.com/wickedfoo) implemented all of the GPU Faiss
+- [Lucas Hosseini](https://github.com/beauby) implemented the binary indexes
+## Reference
+Reference to cite when you use Faiss in a research paper:
+```
+@article{JDH17,
+  title={Billion-scale similarity search with GPUs},
+  author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
+  journal={arXiv preprint arXiv:1702.08734},
+  year={2017}
+}
+```
+## Join the Faiss community
+For public discussion of Faiss or for questions, there is a Facebook group at https://www.facebook.com/groups/faissusers/
+We monitor the [issues page](http://github.com/facebookresearch/faiss/issues) of the repository.
+You can report bugs, ask questions, etc.
+## License
+Faiss is MIT-licensed.
--- a/benchs/README.md
+++ b/benchs/README.md
+# Benchmarking scripts
+This directory contains benchmarking scripts that can reproduce the
+numbers reported in the two papers
+```
+@inproceedings{DJP16,
+  Author = {Douze, Matthijs and J{\'e}gou, Herv{\'e} and Perronnin, Florent},
+  Booktitle = "ECCV",
+  Organization = {Springer},
+  Title = {Polysemous codes},
+  Year = {2016}
+}
+```
+and
+```
+@inproceedings{JDJ17,
+   Author = {Jeff Johnson and Matthijs Douze and Herv{\'e} J{\'e}gou},
+   journal= {arXiv:1702.08734},,
+   Title = {Billion-scale similarity search with GPUs},
+   Year = {2017},
+}
+```
+Note that the numbers (especially timings) change slightly due to changes in the implementation, different machines, etc.
+The scripts are self-contained. They depend only on Faiss and external training data that should be stored in sub-directories.
+## SIFT1M experiments
+The script [`bench_polysemous_sift1m.py`](bench_polysemous_sift1m.py) reproduces the numbers in
+Figure 3 from the "Polysemous" paper.
+### Getting SIFT1M
+To run it, please download the ANN_SIFT1M dataset from
+http://corpus-texmex.irisa.fr/
+and unzip it to the subdirectory sift1M.
+### Result
+The output looks like:
+```
+PQ training on 100000 points, remains 0 points: training polysemous on centroids
+add vectors to index
+PQ baseline        7.517 ms per query, R@1 0.4474
+Polysemous 64      9.875 ms per query, R@1 0.4474
+Polysemous 62      8.358 ms per query, R@1 0.4474
+Polysemous 58      5.531 ms per query, R@1 0.4474
+Polysemous 54      3.420 ms per query, R@1 0.4478
+Polysemous 50      2.182 ms per query, R@1 0.4475
+Polysemous 46      1.621 ms per query, R@1 0.4408
+Polysemous 42      1.448 ms per query, R@1 0.4174
+Polysemous 38      1.331 ms per query, R@1 0.3563
+Polysemous 34      1.334 ms per query, R@1 0.2661
+Polysemous 30      1.272 ms per query, R@1 0.1794
+```
+## Experiments on 1B elements dataset
+The script [`bench_polysemous_1bn.py`](bench_polysemous_1bn.py) reproduces a few experiments on
+two datasets of size 1B from the Polysemous codes" paper.
+### Getting BIGANN
+Download the four files of ANN_SIFT1B from
+http://corpus-texmex.irisa.fr/ to subdirectory bigann/
+### Getting Deep1B
+The ground-truth and queries are available here 
+https://yadi.sk/d/11eDCm7Dsn9GA
+For the learning and database vectors, use the script
+https://github.com/arbabenko/GNOIMI/blob/master/downloadDeep1B.py
+to download the data to subdirectory deep1b/, then concatenate the
+database files to base.fvecs and the training files to learn.fvecs
+### Running the experiments
+These experiments are quite long. To support resuming, the script
+stores the result of training to a temporary directory, `/tmp/bench_polysemous`.
+The script `bench_polysemous_1bn.py` takes at least two arguments:
+- the dataset name: SIFT1000M (aka SIFT1B, aka BIGANN) or Deep1B. SIFT1M, SIFT2M,... are also supported to make subsets of for small experiments (note that SIFT1M as a subset of SIFT1B is not the same as the SIFT1M above)
+- the type of index to build, which should be a valid [index_factory key](https://github.com/facebookresearch/faiss/wiki/High-level-interface-and-auto-tuning#index-factory) (see below for examples)
+- the remaining arguments are parsed as search-time parameters.
+### Experiments of Table 2
+The `IMI*+PolyD+ADC` results in Table 2 can be reproduced with (for 16 bytes):
+```
+python bench_polysemous_1bn.par SIFT1000M IMI2x12,PQ16 nprobe=16,max_codes={10000,30000},ht={44..54}
+```
+Training takes about 2 minutes and adding vectors to the dataset
+takes 3.1 h. These operations are multithreaded. Note that in the command
+above, we use bash's [brace expansion](https://www.gnu.org/software/bash/manual/html_node/Brace-Expansion.html) to set a grid of parameters.
+The search is *not* multithreaded, and the output looks like:
+```
+                                        R@1    R@10   R@100     time    %pass
+nprobe=16,max_codes=10000,ht=44         0.1779 0.2994 0.3139    0.194   12.45
+nprobe=16,max_codes=10000,ht=45         0.1859 0.3183 0.3339    0.197   14.24
+nprobe=16,max_codes=10000,ht=46         0.1930 0.3366 0.3543    0.202   16.22
+nprobe=16,max_codes=10000,ht=47         0.1993 0.3550 0.3745    0.209   18.39
+nprobe=16,max_codes=10000,ht=48         0.2033 0.3694 0.3917    0.640   20.77
+nprobe=16,max_codes=10000,ht=49         0.2070 0.3839 0.4077    0.229   23.36
+nprobe=16,max_codes=10000,ht=50         0.2101 0.3949 0.4205    0.232   26.17
+nprobe=16,max_codes=10000,ht=51         0.2120 0.4042 0.4310    0.239   29.21
+nprobe=16,max_codes=10000,ht=52         0.2134 0.4113 0.4402    0.245   32.47
+nprobe=16,max_codes=10000,ht=53         0.2157 0.4184 0.4482    0.250   35.96
+nprobe=16,max_codes=10000,ht=54         0.2170 0.4240 0.4546    0.256   39.66
+nprobe=16,max_codes=30000,ht=44         0.1882 0.3327 0.3555    0.226   11.29
+nprobe=16,max_codes=30000,ht=45         0.1964 0.3525 0.3771    0.231   13.05
+nprobe=16,max_codes=30000,ht=46         0.2039 0.3713 0.3987    0.236   15.01
+nprobe=16,max_codes=30000,ht=47         0.2103 0.3907 0.4202    0.245   17.19
+nprobe=16,max_codes=30000,ht=48         0.2145 0.4055 0.4384    0.251   19.60
+nprobe=16,max_codes=30000,ht=49         0.2179 0.4198 0.4550    0.257   22.25
+nprobe=16,max_codes=30000,ht=50         0.2208 0.4305 0.4681    0.268   25.15
+nprobe=16,max_codes=30000,ht=51         0.2227 0.4402 0.4791    0.275   28.30
+nprobe=16,max_codes=30000,ht=52         0.2241 0.4473 0.4884    0.284   31.70
+nprobe=16,max_codes=30000,ht=53         0.2265 0.4544 0.4965    0.294   35.34
+nprobe=16,max_codes=30000,ht=54         0.2278 0.4601 0.5031    0.303   39.20
+```
+The result reported in table 2 is the one for which the %pass (percentage of code comparisons that pass the Hamming check) is around 20%, which occurs for Hamming threshold `ht=48`.
+The 8-byte results can be reproduced with the factory key `IMI2x12,PQ8`
+### Experiments of the appendix
+The experiments in the appendix are only in the ArXiv version of the paper (table 3). 
+```
+python bench_polysemous_1bn.py SIFT1000M OPQ8_64,IMI2x13,PQ8 nprobe={1,2,4,8,16,32,64,128},ht={20,24,26,28,30}
+               	R@1    R@10   R@100     time    %pass
+nprobe=1,ht=20 	0.0351 0.0616 0.0751    0.158   19.01
+...
+nprobe=32,ht=28 	0.1256 0.3563 0.5026    0.561   52.61
+...
+```
+Here again the runs are not exactly the same but the original result was obtained from nprobe=32,ht=28.
+For Deep1B, we used a simple version of [auto-tuning](https://github.com/facebookresearch/faiss/wiki/High-level-interface-and-auto-tuning/_edit#auto-tuning-the-runtime-parameters) to sweep through the set of operating points:
+```
+python bench_polysemous_1bn.py Deep1B OPQ20_80,IMI2x14,PQ20 autotune
+...
+Done in 4067.555 s, available OPs:
+Parameters                                1-R@1     time
+                                          0.0000    0.000
+nprobe=1,ht=22,max_codes=256              0.0215    3.115
+nprobe=1,ht=30,max_codes=256              0.0381    3.120
+...
+nprobe=512,ht=68,max_codes=524288         0.4478   36.903
+nprobe=1024,ht=80,max_codes=131072        0.4557   46.363
+nprobe=1024,ht=78,max_codes=262144        0.4616   61.939
+...
+```
+The original results were obtained with `nprobe=1024,ht=66,max_codes=262144`.
+## GPU experiments
+The benchmarks below run 1 or 4 Titan X GPUs and reproduce the results of the "GPU paper". They are also a good starting point on how to use GPU Faiss. 
+### Search on SIFT1M
+See above on how to get SIFT1M into subdirectory sift1M/. The script [`bench_gpu_sift1m.py`](bench_gpu_sift1m.py) reproduces the "exact k-NN time" plot in the ArXiv paper, and the SIFT1M numbers. 
+The output is:
+```
+============ Exact search
+add vectors to index
+warmup
+benchmark
+k=1 0.715 s, R@1 0.9914
+k=2 0.729 s, R@1 0.9935
+k=4 0.731 s, R@1 0.9935
+k=8 0.732 s, R@1 0.9935
+k=16 0.742 s, R@1 0.9935
+k=32 0.737 s, R@1 0.9935
+k=64 0.753 s, R@1 0.9935
+k=128 0.761 s, R@1 0.9935
+k=256 0.799 s, R@1 0.9935
+k=512 0.975 s, R@1 0.9935
+k=1024 1.424 s, R@1 0.9935
+============ Approximate search
+train
+WARNING clustering 100000 points to 4096 centroids: please provide at least 159744 training points
+add vectors to index
+WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 256000000 B, highwater 256000000 B)
+warmup
+benchmark
+nprobe=   1 0.043 s recalls= 0.3909 0.4312 0.4312
+nprobe=   2 0.040 s recalls= 0.5041 0.5636 0.5636
+nprobe=   4 0.048 s recalls= 0.6048 0.6897 0.6897
+nprobe=   8 0.064 s recalls= 0.6879 0.8028 0.8028
+nprobe=  16 0.088 s recalls= 0.7534 0.8940 0.8940
+nprobe=  32 0.134 s recalls= 0.7957 0.9549 0.9550
+nprobe=  64 0.224 s recalls= 0.8125 0.9833 0.9834
+nprobe= 128 0.395 s recalls= 0.8205 0.9953 0.9954
+nprobe= 256 0.717 s recalls= 0.8227 0.9993 0.9994
+nprobe= 512 1.348 s recalls= 0.8228 0.9999 1.0000
+```
+The run produces two warnings:
+- the clustering complains that it does not have enough training data, there is not much we can do about this.
+- the add() function complains that there is an inefficient memory allocation, but this is a concern only when it happens often, and we are not benchmarking the add time anyways.
+To index small datasets, it is more efficient to use a `GpuIVFFlat`, which just stores the full vectors in the inverted lists. We did not mention this in the the paper because it is not as scalable. To experiment with this setting, change the `index_factory` string from "IVF4096,PQ64" to "IVF16384,Flat". This gives:
+```
+nprobe=   1 0.025 s recalls= 0.4084 0.4105 0.4105
+nprobe=   2 0.033 s recalls= 0.5235 0.5264 0.5264
+nprobe=   4 0.033 s recalls= 0.6332 0.6367 0.6367
+nprobe=   8 0.040 s recalls= 0.7358 0.7403 0.7403
+nprobe=  16 0.049 s recalls= 0.8273 0.8324 0.8324
+nprobe=  32 0.068 s recalls= 0.8957 0.9024 0.9024
+nprobe=  64 0.104 s recalls= 0.9477 0.9549 0.9549
+nprobe= 128 0.174 s recalls= 0.9760 0.9837 0.9837
+nprobe= 256 0.299 s recalls= 0.9866 0.9944 0.9944
+nprobe= 512 0.527 s recalls= 0.9907 0.9987 0.9987
+```
+### Clustering on MNIST8m
+To get the "infinite MNIST dataset", follow the instructions on [Léon Bottou's website](http://leon.bottou.org/projects/infimnist). The script assumes the file `mnist8m-patterns-idx3-ubyte` is in subdirectory `mnist8m`
+The script [`kmeans_mnist.py`](kmeans_mnist.py) produces the following output: 
+```
+python kmeans_mnist.py 1 256
+...
+Clustering 8100000 points in 784D to 256 clusters, redo 1 times, 20 iterations
+  Preprocessing in 7.94526 s
+  Iteration 19 (131.697 s, search 114.78 s): objective=1.44881e+13 imbalance=1.05963 nsplit=0        
+final objective: 1.449e+13
+total runtime: 140.615 s
+```
+### search on SIFT1B
+The script [`bench_gpu_1bn.py`](bench_gpu_1bn.py) runs multi-gpu searches on the two 1-billion vector datasets we considered. It is more complex than the previous scripts, because it supports many search options and decomposes the dataset build process in Python to exploit the best possible CPU/GPU parallelism and GPU distribution.
+Even on multiple GPUs, building the 1B datasets can last several hours. It is often a good idea to validate that everything is working fine on smaller datasets like SIFT1M, SIFT2M, etc.
+The search results on SIFT1B in the "GPU paper" can be obtained with 
+<!-- see P57124181 -->
+```
+python bench_gpu_1bn.py SIFT1000M OPQ8_32,IVF262144,PQ8 -nnn 10 -ngpu 1 -tempmem $[1536*1024*1024]
+...
+0/10000 (0.024 s)      probe=1  : 0.161 s 1-R@1: 0.0752 1-R@10: 0.1924
+0/10000 (0.005 s)      probe=2  : 0.150 s 1-R@1: 0.0964 1-R@10: 0.2693
+0/10000 (0.005 s)      probe=4  : 0.153 s 1-R@1: 0.1102 1-R@10: 0.3328
+0/10000 (0.005 s)      probe=8  : 0.170 s 1-R@1: 0.1220 1-R@10: 0.3827
+0/10000 (0.005 s)      probe=16 : 0.196 s 1-R@1: 0.1290 1-R@10: 0.4151
+0/10000 (0.006 s)      probe=32 : 0.244 s 1-R@1: 0.1314 1-R@10: 0.4345
+0/10000 (0.006 s)      probe=64 : 0.353 s 1-R@1: 0.1332 1-R@10: 0.4461
+0/10000 (0.005 s)      probe=128: 0.587 s 1-R@1: 0.1341 1-R@10: 0.4502
+0/10000 (0.006 s)      probe=256: 1.160 s 1-R@1: 0.1342 1-R@10: 0.4511
+```
+We use the `-tempmem` option to reduce the temporary memory allocation to 1.5G, otherwise the dataset does not fit in GPU memory
+### search on Deep1B
+The same script generates the GPU search results on Deep1B. 
+```
+python bench_gpu_1bn.py  Deep1B OPQ20_80,IVF262144,PQ20 -nnn 10 -R 2 -ngpu 4 -altadd -noptables -tempmem $[1024*1024*1024]
+...
+0/10000 (0.115 s)      probe=1  : 0.239 s 1-R@1: 0.2387 1-R@10: 0.3420
+0/10000 (0.006 s)      probe=2  : 0.103 s 1-R@1: 0.3110 1-R@10: 0.4623
+0/10000 (0.005 s)      probe=4  : 0.105 s 1-R@1: 0.3772 1-R@10: 0.5862
+0/10000 (0.005 s)      probe=8  : 0.116 s 1-R@1: 0.4235 1-R@10: 0.6889
+0/10000 (0.005 s)      probe=16 : 0.133 s 1-R@1: 0.4517 1-R@10: 0.7693
+0/10000 (0.005 s)      probe=32 : 0.168 s 1-R@1: 0.4713 1-R@10: 0.8281
+0/10000 (0.005 s)      probe=64 : 0.238 s 1-R@1: 0.4841 1-R@10: 0.8649
+0/10000 (0.007 s)      probe=128: 0.384 s 1-R@1: 0.4900 1-R@10: 0.8816
+0/10000 (0.005 s)      probe=256: 0.736 s 1-R@1: 0.4933 1-R@10: 0.8912
+```
+Here we are a bit tight on memory so we disable precomputed tables (`-noptables`) and restrict the amount of temporary memory. The `-altadd` option avoids GPU memory overflows during add.
+### knn-graph on Deep1B
+The same script generates the KNN-graph on Deep1B. Note that the inverted file from above will not be re-used because the training sets are different. For the knngraph, the script will first do a pass over the whole dataset to compute the ground-truth knn for a subset of 10k nodes, for evaluation.
+```
+python bench_gpu_1bn.py Deep1B OPQ20_80,IVF262144,PQ20 -nnn 10 -altadd -knngraph  -R 2 -noptables -tempmem $[1<<30] -ngpu 4
+...
+CPU index contains 1000000000 vectors, move to GPU
+Copy CPU index to 2 sharded GPU indexes
+   dispatch to GPUs 0:2
+IndexShards shard 0 indices 0:500000000
+  IndexIVFPQ size 500000000 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=0 reserveVecs=0
+IndexShards shard 1 indices 500000000:1000000000
+  IndexIVFPQ size 500000000 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=0 reserveVecs=0
+   dispatch to GPUs 2:4
+IndexShards shard 0 indices 0:500000000
+  IndexIVFPQ size 500000000 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=0 reserveVecs=0
+IndexShards shard 1 indices 500000000:1000000000
+  IndexIVFPQ size 500000000 -> GpuIndexIVFPQ indicesOptions=0 usePrecomputed=0 useFloat16=0 reserveVecs=0
+move to GPU done in 151.535 s
+search...
+999997440/1000000000 (8389.961 s, 0.3379)      probe=1  : 8389.990 s rank-10 intersection results: 0.3379
+999997440/1000000000 (9205.934 s, 0.4079)      probe=2  : 9205.966 s rank-10 intersection results: 0.4079
+999997440/1000000000 (9741.095 s, 0.4722)      probe=4  : 9741.128 s rank-10 intersection results: 0.4722
+999997440/1000000000 (10830.420 s, 0.5256)      probe=8  : 10830.455 s rank-10 intersection results: 0.5256
+999997440/1000000000 (12531.716 s, 0.5603)      probe=16 : 12531.758 s rank-10 intersection results: 0.5603
+999997440/1000000000 (15922.519 s, 0.5825)      probe=32 : 15922.571 s rank-10 intersection results: 0.5825
+999997440/1000000000 (22774.153 s, 0.5950)      probe=64 : 22774.220 s rank-10 intersection results: 0.5950
+999997440/1000000000 (36717.207 s, 0.6015)      probe=128: 36717.309 s rank-10 intersection results: 0.6015
+999997440/1000000000 (70616.392 s, 0.6047)      probe=256: 70616.581 s rank-10 intersection results: 0.6047
+```
--- a/benchs/bench_6bit_codec.cpp
+++ b/benchs/bench_6bit_codec.cpp
+/**
+ * Copyright (c) Facebook, Inc. and its affiliates.
+ *
+ * This source code is licensed under the MIT license found in the
+ * LICENSE file in the root directory of this source tree.
+ */
+#include <omp.h>
+#include <cstdio>
+#include <faiss/impl/ScalarQuantizer.h>
+#include <faiss/utils/distances.h>
+#include <faiss/utils/random.h>
+#include <faiss/utils/utils.h>
+using namespace faiss;
+int main() {
+    int d = 128;
+    int n = 2000;
+    std::vector<float> x(d * n);
+    float_rand(x.data(), d * n, 12345);
+    // make sure it's idempotent
+    ScalarQuantizer sq(d, ScalarQuantizer::QT_6bit);
+    omp_set_num_threads(1);
+    sq.train(n, x.data());
+    size_t code_size = sq.code_size;
+    printf("code size: %ld\n", sq.code_size);
+    // encode
+    std::vector<uint8_t> codes(code_size * n);
+    sq.compute_codes(x.data(), codes.data(), n);
+    // decode
+    std::vector<float> x2(d * n);
+    sq.decode(codes.data(), x2.data(), n);
+    printf("sqL2 recons error: %g\n",
+           fvec_L2sqr(x.data(), x2.data(), n * d) / n);
+    // encode again
+    std::vector<uint8_t> codes2(code_size * n);
+    sq.compute_codes(x2.data(), codes2.data(), n);
+    size_t ndiff = 0;
+    for (size_t i = 0; i < codes.size(); i++) {
+        if (codes[i] != codes2[i])
+            ndiff++;
+    }
+    printf("ndiff for idempotence: %ld / %ld\n", ndiff, codes.size());
+    std::unique_ptr<ScalarQuantizer::SQDistanceComputer> dc(
+            sq.get_distance_computer());
+    dc->codes = codes.data();
+    dc->code_size = sq.code_size;
+    printf("code size: %ld\n", dc->code_size);
+    double sum_dis = 0;
+    double t0 = getmillisecs();
+    for (int i = 0; i < n; i++) {
+        dc->set_query(&x[i * d]);
+        for (int j = 0; j < n; j++) {
+            sum_dis += (*dc)(j);
+        }
+    }
+    printf("distances computed in %.3f ms, checksum=%g\n",
+           getmillisecs() - t0,
+           sum_dis);
+    return 0;
+}
--- a/benchs/bench_all_ivf/README.md
+++ b/benchs/bench_all_ivf/README.md
+# Benchmark of IVF variants
+This is a benchmark of IVF index variants, looking at compression vs. speed vs. accuracy. 
+The results are in [this wiki chapter](https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors)
+The code is organized as: 
+- `datasets.py`: code to access the datafiles, compute the ground-truth and report accuracies
+- `bench_all_ivf.py`: evaluate one type of inverted file
+- `run_on_cluster_generic.bash`: call `bench_all_ivf.py` for all tested types of indices. 
+Since the number of experiments is quite large the script is structued so that the benchmark can be run on a cluster.
+- `parse_bench_all_ivf.py`: make nice tradeoff plots from all the results. 
+The code depends on Faiss and can use 1 to 8 GPUs to do the k-means clustering for large vocabularies. 
+It was run in October 2018 for the results in the wiki. 
--- a/benchs/bench_all_ivf/bench_all_ivf.py
+++ b/benchs/bench_all_ivf/bench_all_ivf.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import sys
+import time
+import pdb
+import numpy as np
+import faiss
+import argparse
+import datasets
+from datasets import sanitize
+######################################################
+# Command-line parsing
+######################################################
+parser = argparse.ArgumentParser()
+def aa(*args, **kwargs):
+    group.add_argument(*args, **kwargs)
+group = parser.add_argument_group('dataset options')
+aa('--db', default='deep1M', help='dataset')
+aa('--compute_gt', default=False, action='store_true',
+    help='compute and store the groundtruth')
+aa('--force_IP', default=False, action="store_true",
+    help='force IP search instead of L2')
+group = parser.add_argument_group('index consturction')
+aa('--indexkey', default='HNSW32', help='index_factory type')
+aa('--maxtrain', default=256 * 256, type=int,
+   help='maximum number of training points (0 to set automatically)')
+aa('--indexfile', default='', help='file to read or write index from')
+aa('--add_bs', default=-1, type=int,
+   help='add elements index by batches of this size')
+group = parser.add_argument_group('IVF options')
+aa('--by_residual', default=-1, type=int,
+    help="set if index should use residuals (default=unchanged)")
+aa('--no_precomputed_tables', action='store_true', default=False,
+   help='disable precomputed tables (uses less memory)')
+aa('--get_centroids_from', default='',
+   help='get the centroids from this index (to speed up training)')
+aa('--clustering_niter', default=-1, type=int,
+   help='number of clustering iterations (-1 = leave default)')
+aa('--train_on_gpu', default=False, action='store_true',
+   help='do training on GPU')
+group = parser.add_argument_group('index-specific options')
+aa('--M0', default=-1, type=int, help='size of base level for HNSW')
+aa('--RQ_train_default', default=False, action="store_true",
+    help='disable progressive dim training for RQ')
+aa('--RQ_beam_size', default=-1, type=int,
+    help='set beam size at add time')
+aa('--LSQ_encode_ils_iters', default=-1, type=int,
+    help='ILS iterations for LSQ')
+aa('--RQ_use_beam_LUT', default=-1, type=int,
+    help='use beam LUT at add time')
+group = parser.add_argument_group('searching')
+aa('--k', default=100, type=int, help='nb of nearest neighbors')
+aa('--inter', default=False, action='store_true',
+    help='use intersection measure instead of 1-recall as metric')
+aa('--searchthreads', default=-1, type=int,
+   help='nb of threads to use at search time')
+aa('--searchparams', nargs='+', default=['autotune'],
+   help="search parameters to use (can be autotune or a list of params)")
+aa('--n_autotune', default=500, type=int,
+   help="max nb of autotune experiments")
+aa('--autotune_max', default=[], nargs='*',
+   help='set max value for autotune variables format "var:val" (exclusive)')
+aa('--autotune_range', default=[], nargs='*',
+   help='set complete autotune range, format "var:val1,val2,..."')
+aa('--min_test_duration', default=3.0, type=float,
+   help='run test at least for so long to avoid jitter')
+args = parser.parse_args()
+print("args:", args)
+os.system('echo -n "nb processors "; '
+          'cat /proc/cpuinfo | grep ^processor | wc -l; '
+          'cat /proc/cpuinfo | grep ^"model name" | tail -1')
+######################################################
+# Load dataset
+######################################################
+ds = datasets.load_dataset(
+    dataset=args.db, compute_gt=args.compute_gt)
+if args.force_IP:
+    ds.metric = "IP"
+print(ds)
+nq, d = ds.nq, ds.d
+nb, d = ds.nq, ds.d
+######################################################
+# Make index
+######################################################
+def unwind_index_ivf(index):
+    if isinstance(index, faiss.IndexPreTransform):
+        assert index.chain.size() == 1
+        vt = index.chain.at(0)
+        index_ivf, vt2 = unwind_index_ivf(faiss.downcast_index(index.index))
+        assert vt2 is None
+        return index_ivf, vt
+    if hasattr(faiss, "IndexRefine") and isinstance(index, faiss.IndexRefine):
+        return unwind_index_ivf(faiss.downcast_index(index.base_index))
+    if isinstance(index, faiss.IndexIVF):
+        return index, None
+    else:
+        return None, None
+def apply_AQ_options(index, args):
+    # if not(
+    #    isinstance(index, faiss.IndexAdditiveQuantize) or
+    #    isinstance(index, faiss.IndexIVFAdditiveQuantizer)):
+    #    return
+    if args.RQ_train_default:
+        print("set default training for RQ")
+        index.rq.train_type
+        index.rq.train_type = faiss.ResidualQuantizer.Train_default
+    if args.RQ_beam_size != -1:
+        print("set RQ beam size to", args.RQ_beam_size)
+        index.rq.max_beam_size
+        index.rq.max_beam_size = args.RQ_beam_size
+    if args.LSQ_encode_ils_iters != -1:
+        print("set LSQ ils iterations to", args.LSQ_encode_ils_iters)
+        index.lsq.encode_ils_iters
+        index.lsq.encode_ils_iters = args.LSQ_encode_ils_iters
+    if args.RQ_use_beam_LUT != -1:
+        print("set RQ beam LUT to", args.RQ_use_beam_LUT)
+        index.rq.use_beam_LUT
+        index.rq.use_beam_LUT = args.RQ_use_beam_LUT
+if args.indexfile and os.path.exists(args.indexfile):
+    print("reading", args.indexfile)
+    index = faiss.read_index(args.indexfile)
+    index_ivf, vec_transform = unwind_index_ivf(index)
+    if vec_transform is None:
+        vec_transform = lambda x: x
+else:
+    print("build index, key=", args.indexkey)
+    index = faiss.index_factory(
+        d, args.indexkey, faiss.METRIC_L2 if ds.metric == "L2" else
+        faiss.METRIC_INNER_PRODUCT
+    )
+    index_ivf, vec_transform = unwind_index_ivf(index)
+    if vec_transform is None:
+        vec_transform = lambda x: x
+    else:
+        vec_transform = faiss.downcast_VectorTransform(vec_transform)
+    if args.by_residual != -1:
+        by_residual = args.by_residual == 1
+        print("setting by_residual = ", by_residual)
+        index_ivf.by_residual   # check if field exists
+        index_ivf.by_residual = by_residual
+    if index_ivf:
+        print("Update add-time parameters")
+        # adjust default parameters used at add time for quantizers
+        # because otherwise the assignment is inaccurate
+        quantizer = faiss.downcast_index(index_ivf.quantizer)
+        if isinstance(quantizer, faiss.IndexRefine):
+            print("   update quantizer k_factor=", quantizer.k_factor, end=" -> ")
+            quantizer.k_factor = 32 if index_ivf.nlist < 1e6 else 64
+            print(quantizer.k_factor)
+            base_index = faiss.downcast_index(quantizer.base_index)
+            if isinstance(base_index, faiss.IndexIVF):
+                print("   update quantizer nprobe=", base_index.nprobe, end=" -> ")
+                base_index.nprobe = (
+                    16 if base_index.nlist < 1e5 else
+                    32 if base_index.nlist < 4e6 else
+                    64)
+                print(base_index.nprobe)
+        elif isinstance(quantizer, faiss.IndexHNSW):
+            print("   update quantizer efSearch=", quantizer.hnsw.efSearch, end=" -> ")
+            quantizer.hnsw.efSearch = 40 if index_ivf.nlist < 4e6 else 64
+            print(quantizer.hnsw.efSearch)
+    apply_AQ_options(index_ivf or index, args)
+    if index_ivf:
+        index_ivf.verbose = True
+        index_ivf.quantizer.verbose = True
+        index_ivf.cp.verbose = True
+    else:
+        index.verbose = True
+    maxtrain = args.maxtrain
+    if maxtrain == 0:
+        if 'IMI' in args.indexkey:
+            maxtrain = int(256 * 2 ** (np.log2(index_ivf.nlist) / 2))
+        elif index_ivf:
+            maxtrain = 50 * index_ivf.nlist
+        else:
+            # just guess...
+            maxtrain = 256 * 100
+        maxtrain = max(maxtrain, 256 * 100)
+        print("setting maxtrain to %d" % maxtrain)
+    try:
+        xt2 = ds.get_train(maxtrain=maxtrain)
+    except NotImplementedError:
+        print("No training set: training on database")
+        xt2 = ds.get_database()[:maxtrain]
+    print("train, size", xt2.shape)
+    assert np.all(np.isfinite(xt2))
+    if (isinstance(vec_transform, faiss.OPQMatrix) and
+        isinstance(index_ivf, faiss.IndexIVFPQFastScan)):
+        print("  Forcing OPQ training PQ to PQ4")
+        ref_pq = index_ivf.pq
+        training_pq = faiss.ProductQuantizer(
+            ref_pq.d, ref_pq.M, ref_pq.nbits
+        )
+        vec_transform.pq
+        vec_transform.pq = training_pq
+    if args.get_centroids_from == '':
+        if args.clustering_niter >= 0:
+            print(("setting nb of clustering iterations to %d" %
+                   args.clustering_niter))
+            index_ivf.cp.niter = args.clustering_niter
+        if args.train_on_gpu:
+            print("add a training index on GPU")
+            train_index = faiss.index_cpu_to_all_gpus(
+                    faiss.IndexFlatL2(index_ivf.d))
+            index_ivf.clustering_index = train_index
+    else:
+        print("Getting centroids from", args.get_centroids_from)
+        src_index = faiss.read_index(args.get_centroids_from)
+        src_quant = faiss.downcast_index(src_index.quantizer)
+        centroids = faiss.vector_to_array(src_quant.xb)
+        centroids = centroids.reshape(-1, d)
+        print("  centroid table shape", centroids.shape)
+        if isinstance(vec_transform, faiss.VectorTransform):
+            print("  training vector transform")
+            vec_transform.train(xt2)
+            print("  transform centroids")
+            centroids = vec_transform.apply_py(centroids)
+        if not index_ivf.quantizer.is_trained:
+            print("  training quantizer")
+            index_ivf.quantizer.train(centroids)
+        print("  add centroids to quantizer")
+        index_ivf.quantizer.add(centroids)
+        del src_index
+    t0 = time.time()
+    index.train(xt2)
+    print("  train in %.3f s" % (time.time() - t0))
+    print("adding")
+    t0 = time.time()
+    if args.add_bs == -1:
+        index.add(sanitize(ds.get_database()))
+    else:
+        i0 = 0
+        for xblock in ds.database_iterator(bs=args.add_bs):
+            i1 = i0 + len(xblock)
+            print("  adding %d:%d / %d [%.3f s, RSS %d kiB] " % (
+                i0, i1, ds.nb, time.time() - t0,
+                faiss.get_mem_usage_kb()))
+            index.add(xblock)
+            i0 = i1
+    print("  add in %.3f s" % (time.time() - t0))
+    if args.indexfile:
+        print("storing", args.indexfile)
+        faiss.write_index(index, args.indexfile)
+if args.no_precomputed_tables:
+    if isinstance(index_ivf, faiss.IndexIVFPQ):
+        print("disabling precomputed table")
+        index_ivf.use_precomputed_table = -1
+        index_ivf.precomputed_table.clear()
+if args.indexfile:
+    print("index size on disk: ", os.stat(args.indexfile).st_size)
+if hasattr(index, "code_size"):
+    print("vector code_size", index.code_size)
+if hasattr(index_ivf, "code_size"):
+    print("vector code_size (IVF)", index_ivf.code_size)
+print("current RSS:", faiss.get_mem_usage_kb() * 1024)
+precomputed_table_size = 0
+if hasattr(index_ivf, 'precomputed_table'):
+    precomputed_table_size = index_ivf.precomputed_table.size() * 4
+print("precomputed tables size:", precomputed_table_size)
+#############################################################
+# Index is ready
+#############################################################
+xq = sanitize(ds.get_queries())
+gt = ds.get_groundtruth(k=args.k)
+assert gt.shape[1] == args.k, pdb.set_trace()
+if args.searchthreads != -1:
+    print("Setting nb of threads to", args.searchthreads)
+    faiss.omp_set_num_threads(args.searchthreads)
+else:
+    print("nb search threads: ", faiss.omp_get_max_threads())
+ps = faiss.ParameterSpace()
+ps.initialize(index)
+parametersets = args.searchparams
+if args.inter:
+    header = (
+        '%-40s     inter@%3d time(ms/q)   nb distances #runs' %
+        ("parameters", args.k)
+    )
+else:
+    header = (
+        '%-40s     R@1   R@10  R@100  time(ms/q)   nb distances #runs' %
+        "parameters"
+    )
+def compute_inter(a, b):
+    nq, rank = a.shape
+    ninter = sum(
+        np.intersect1d(a[i, :rank], b[i, :rank]).size
+        for i in range(nq)
+    )
+    return ninter / a.size
+def eval_setting(index, xq, gt, k, inter, min_time):
+    nq = xq.shape[0]
+    ivf_stats = faiss.cvar.indexIVF_stats
+    ivf_stats.reset()
+    nrun = 0
+    t0 = time.time()
+    while True:
+        D, I = index.search(xq, k)
+        nrun += 1
+        t1 = time.time()
+        if t1 - t0 > min_time:
+            break
+    ms_per_query = ((t1 - t0) * 1000.0 / nq / nrun)
+    if inter:
+        rank = k
+        inter_measure = compute_inter(gt[:, :rank], I[:, :rank])
+        print("%.4f" % inter_measure, end=' ')
+    else:
+        for rank in 1, 10, 100:
+            n_ok = (I[:, :rank] == gt[:, :1]).sum()
+            print("%.4f" % (n_ok / float(nq)), end=' ')
+    print("   %9.5f  " % ms_per_query, end=' ')
+    print("%12d   " % (ivf_stats.ndis / nrun), end=' ')
+    print(nrun)
+if parametersets == ['autotune']:
+    ps.n_experiments = args.n_autotune
+    ps.min_test_duration = args.min_test_duration
+    for kv in args.autotune_max:
+        k, vmax = kv.split(':')
+        vmax = float(vmax)
+        print("limiting %s to %g" % (k, vmax))
+        pr = ps.add_range(k)
+        values = faiss.vector_to_array(pr.values)
+        values = np.array([v for v in values if v < vmax])
+        faiss.copy_array_to_vector(values, pr.values)
+    for kv in args.autotune_range:
+        k, vals = kv.split(':')
+        vals = np.fromstring(vals, sep=',')
+        print("setting %s to %s" % (k, vals))
+        pr = ps.add_range(k)
+        faiss.copy_array_to_vector(vals, pr.values)
+    # setup the Criterion object
+    if args.inter:
+        print("Optimize for intersection @ ", args.k)
+        crit = faiss.IntersectionCriterion(nq, args.k)
+    else:
+        print("Optimize for 1-recall @ 1")
+        crit = faiss.OneRecallAtRCriterion(nq, 1)
+    # by default, the criterion will request only 1 NN
+    crit.nnn = args.k
+    crit.set_groundtruth(None, gt.astype('int64'))
+    # then we let Faiss find the optimal parameters by itself
+    print("exploring operating points, %d threads" % faiss.omp_get_max_threads());
+    ps.display()
+    t0 = time.time()
+    op = ps.explore(index, xq, crit)
+    print("Done in %.3f s, available OPs:" % (time.time() - t0))
+    op.display()
+    print("Re-running evaluation on selected OPs")
+    print(header)
+    opv = op.optimal_pts
+    maxw = max(max(len(opv.at(i).key) for i in range(opv.size())), 40)
+    for i in range(opv.size()):
+        opt = opv.at(i)
+        ps.set_index_parameters(index, opt.key)
+        print(opt.key.ljust(maxw), end=' ')
+        sys.stdout.flush()
+        eval_setting(index, xq, gt, args.k, args.inter, args.min_test_duration)
+else:
+    print(header)
+    for param in parametersets:
+        print("%-40s " % param, end=' ')
+        sys.stdout.flush()
+        ps.set_index_parameters(index, param)
+        eval_setting(index, xq, gt, args.k, args.inter, args.min_test_duration)
--- a/benchs/bench_all_ivf/bench_kmeans.py
+++ b/benchs/bench_all_ivf/bench_kmeans.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import numpy as np
+import faiss
+import argparse
+import datasets
+from datasets import sanitize
+######################################################
+# Command-line parsing
+######################################################
+parser = argparse.ArgumentParser()
+def aa(*args, **kwargs):
+    group.add_argument(*args, **kwargs)
+group = parser.add_argument_group('dataset options')
+aa('--db', default='deep1M', help='dataset')
+aa('--nt', default=65536, type=int)
+aa('--nb', default=100000, type=int)
+aa('--nt_sample', default=0, type=int)
+group = parser.add_argument_group('kmeans options')
+aa('--k', default=256, type=int)
+aa('--seed', default=12345, type=int)
+aa('--pcadim', default=-1, type=int, help='PCA to this dimension')
+aa('--niter', default=25, type=int)
+aa('--eval_freq', default=100, type=int)
+args = parser.parse_args()
+print("args:", args)
+os.system('echo -n "nb processors "; '
+          'cat /proc/cpuinfo | grep ^processor | wc -l; '
+          'cat /proc/cpuinfo | grep ^"model name" | tail -1')
+ngpu = faiss.get_num_gpus()
+print("nb GPUs:", ngpu)
+######################################################
+# Load dataset
+######################################################
+xt, xb, xq, gt = datasets.load_data(dataset=args.db)
+if args.nt_sample == 0:
+    xt_pca = xt[args.nt:args.nt + 10000]
+    xt = xt[:args.nt]
+else:
+    xt_pca = xt[args.nt_sample:args.nt_sample + 10000]
+    rs = np.random.RandomState(args.seed)
+    idx = rs.choice(args.nt_sample, size=args.nt, replace=False)
+    xt = xt[idx]
+xb = xb[:args.nb]
+d = xb.shape[1]
+if args.pcadim != -1:
+    print("training PCA: %d -> %d" % (d, args.pcadim))
+    pca = faiss.PCAMatrix(d, args.pcadim)
+    pca.train(sanitize(xt_pca))
+    xt = pca.apply_py(sanitize(xt))
+    xb = pca.apply_py(sanitize(xb))
+    d = xb.shape[1]
+######################################################
+# Run clustering
+######################################################
+index = faiss.IndexFlatL2(d)
+if ngpu > 0:
+    print("moving index to GPU")
+    index = faiss.index_cpu_to_all_gpus(index)
+clustering = faiss.Clustering(d, args.k)
+clustering.verbose = True
+clustering.seed = args.seed
+clustering.max_points_per_centroid = 10**6
+clustering.min_points_per_centroid = 1
+centroids = None
+for iter0 in range(0, args.niter, args.eval_freq):
+    iter1 = min(args.niter, iter0 + args.eval_freq)
+    clustering.niter = iter1 - iter0
+    if iter0 > 0:
+        faiss.copy_array_to_vector(centroids.ravel(), clustering.centroids)
+    clustering.train(sanitize(xt), index)
+    index.reset()
+    centroids = faiss.vector_to_array(clustering.centroids).reshape(args.k, d)
+    index.add(centroids)
+    _, I = index.search(sanitize(xb), 1)
+    error = ((xb - centroids[I.ravel()]) ** 2).sum()
+    print("iter1=%d quantization error on test: %.4f" % (iter1, error))
--- a/benchs/bench_all_ivf/cmp_with_scann.py
+++ b/benchs/bench_all_ivf/cmp_with_scann.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import time
+import sys
+import os
+import argparse
+import numpy as np
+def eval_recalls(name, I, gt, times):
+    k = I.shape[1]
+    s = "%-40s recall" % name
+    nq = len(gt)
+    for rank in 1, 10, 100, 1000:
+        if rank > k:
+            break
+        recall = (I[:, :rank] == gt[:, :1]).sum() / nq
+        s += "@%d: %.4f " % (rank, recall)
+    s += "time: %.4f s (± %.4f)" % (np.mean(times), np.std(times))
+    print(s)
+def eval_inters(name, I, gt, times):
+    k = I.shape[1]
+    s = "%-40s inter" % name
+    nq = len(gt)
+    for rank in 1, 10, 100, 1000:
+        if rank > k:
+            break
+        ninter = 0
+        for i in range(nq):
+            ninter += np.intersect1d(I[i, :rank], gt[i, :rank]).size
+        inter = ninter / (nq * rank)
+        s += "@%d: %.4f " % (rank, inter)
+    s += "time: %.4f s (± %.4f)" % (np.mean(times), np.std(times))
+    print(s)
+def main():
+    parser = argparse.ArgumentParser()
+    def aa(*args, **kwargs):
+        group.add_argument(*args, **kwargs)
+    group = parser.add_argument_group('dataset options')
+    aa('--db', default='deep1M', help='dataset')
+    aa('--measure', default="1-recall",
+        help="perf measure to use: 1-recall or inter")
+    aa('--download', default=False, action="store_true")
+    aa('--lib', default='faiss', help='library to use (faiss or scann)')
+    aa('--thenscann', default=False, action="store_true")
+    aa('--base_dir', default='/checkpoint/matthijs/faiss_improvements/cmp_ivf_scan_2')
+    group = parser.add_argument_group('searching')
+    aa('--k', default=10, type=int, help='nb of nearest neighbors')
+    aa('--pre_reorder_k', default="0,10,100,1000", help='values for reorder_k')
+    aa('--nprobe', default="1,2,5,10,20,50,100,200", help='values for nprobe')
+    aa('--nrun', default=5, type=int, help='nb of runs to perform')
+    args = parser.parse_args()
+    print("args:", args)
+    pre_reorder_k_tab = [int(x) for x in args.pre_reorder_k.split(',')]
+    nprobe_tab = [int(x) for x in args.nprobe.split(',')]
+    os.system('echo -n "nb processors "; '
+            'cat /proc/cpuinfo | grep ^processor | wc -l; '
+            'cat /proc/cpuinfo | grep ^"model name" | tail -1')
+    cache_dir = args.base_dir + "/" + args.db + "/"
+    k = args.k
+    nrun = args.nrun
+    if args.lib == "faiss":
+        # prepare cache
+        import faiss
+        from datasets import load_dataset
+        ds = load_dataset(args.db, download=args.download)
+        print(ds)
+        if not os.path.exists(cache_dir + "xb.npy"):
+            # store for SCANN
+            os.system(f"rm -rf {cache_dir}; mkdir -p {cache_dir}")
+            tosave = dict(
+                # xt = ds.get_train(10),
+                xb = ds.get_database(),
+                xq = ds.get_queries(),
+                gt = ds.get_groundtruth()
+            )
+            for name, v in tosave.items():
+                fname = cache_dir + "/" + name + ".npy"
+                print("save", fname)
+                np.save(fname, v)
+            open(cache_dir + "metric", "w").write(ds.metric)
+        name1_to_metric = {
+            "IP": faiss.METRIC_INNER_PRODUCT,
+            "L2": faiss.METRIC_L2
+        }
+        index_fname = cache_dir + "index.faiss"
+        if not os.path.exists(index_fname):
+            index = faiss_make_index(
+                ds.get_database(), name1_to_metric[ds.metric], index_fname)
+        else:
+            index = faiss.read_index(index_fname)
+        xb = ds.get_database()
+        xq = ds.get_queries()
+        gt = ds.get_groundtruth()
+        faiss_eval_search(
+                index, xq, xb, nprobe_tab, pre_reorder_k_tab, k, gt,
+                nrun, args.measure
+        )
+    if args.lib == "scann":
+        from scann.scann_ops.py import scann_ops_pybind
+        dataset = {}
+        for kn in "xb xq gt".split():
+            fname = cache_dir + "/" + kn + ".npy"
+            print("load", fname)
+            dataset[kn] = np.load(fname)
+        name1_to_name2 = {
+            "IP": "dot_product",
+            "L2": "squared_l2"
+        }
+        distance_measure = name1_to_name2[open(cache_dir + "metric").read()]
+        xb = dataset["xb"]
+        xq = dataset["xq"]
+        gt = dataset["gt"]
+        scann_dir = cache_dir + "/scann1.1.1_serialized"
+        if os.path.exists(scann_dir + "/scann_config.pb"):
+            searcher = scann_ops_pybind.load_searcher(scann_dir)
+        else:
+            searcher = scann_make_index(xb, distance_measure, scann_dir, 0)
+        scann_dir = cache_dir + "/scann1.1.1_serialized_reorder"
+        if os.path.exists(scann_dir + "/scann_config.pb"):
+            searcher_reo = scann_ops_pybind.load_searcher(scann_dir)
+        else:
+            searcher_reo = scann_make_index(xb, distance_measure, scann_dir, 100)
+        scann_eval_search(
+            searcher, searcher_reo,
+            xq, xb, nprobe_tab, pre_reorder_k_tab, k, gt,
+            nrun, args.measure
+        )
+    if args.lib != "scann" and args.thenscann:
+        # just append --lib scann, that will override the previous cmdline
+        # options
+        cmdline = " ".join(sys.argv) + " --lib scann"
+        cmdline = (
+            ". ~/anaconda3/etc/profile.d/conda.sh ; " +
+            "conda activate scann_1.1.1; "
+            "python -u " + cmdline)
+        print("running", cmdline)
+        os.system(cmdline)
+###############################################################
+# SCANN
+###############################################################
+def scann_make_index(xb, distance_measure, scann_dir, reorder_k):
+    import scann
+    print("build index")
+    if distance_measure == "dot_product":
+        thr = 0.2
+    else:
+        thr = 0
+    k = 10
+    sb = scann.scann_ops_pybind.builder(xb, k, distance_measure)
+    sb = sb.tree(num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000)
+    sb = sb.score_ah(2, anisotropic_quantization_threshold=thr)
+    if reorder_k > 0:
+        sb = sb.reorder(reorder_k)
+    searcher = sb.build()
+    print("done")
+    print("write index to", scann_dir)
+    os.system(f"rm -rf {scann_dir}; mkdir -p {scann_dir}")
+    # os.mkdir(scann_dir)
+    searcher.serialize(scann_dir)
+    return searcher
+def scann_eval_search(
+        searcher, searcher_reo,
+        xq, xb, nprobe_tab, pre_reorder_k_tab, k, gt,
+        nrun, measure):
+    # warmup
+    for _run in range(5):
+        searcher.search_batched(xq)
+    for nprobe in nprobe_tab:
+        for pre_reorder_k in pre_reorder_k_tab:
+            times = []
+            for _run in range(nrun):
+                if pre_reorder_k == 0:
+                    t0 = time.time()
+                    I, D = searcher.search_batched(
+                        xq, leaves_to_search=nprobe, final_num_neighbors=k
+                    )
+                    t1 = time.time()
+                else:
+                    t0 = time.time()
+                    I, D = searcher_reo.search_batched(
+                        xq, leaves_to_search=nprobe, final_num_neighbors=k,
+                        pre_reorder_num_neighbors=pre_reorder_k
+                    )
+                    t1 = time.time()
+                times.append(t1 - t0)
+            header = "SCANN nprobe=%4d reo=%4d" % (nprobe, pre_reorder_k)
+            if measure == "1-recall":
+                eval_recalls(header, I, gt, times)
+            else:
+                eval_inters(header, I, gt, times)
+###############################################################
+# Faiss
+###############################################################
+def faiss_make_index(xb, metric_type, fname):
+    import faiss
+    d = xb.shape[1]
+    M = d // 2
+    index = faiss.index_factory(d, f"IVF2000,PQ{M}x4fs", metric_type)
+    # if not by_residual:
+    #    print("setting no residual")
+    #    index.by_residual = False
+    print("train")
+    # index.train(ds.get_train())
+    index.train(xb[:250000])
+    print("add")
+    index.add(xb)
+    print("write index", fname)
+    faiss.write_index(index, fname)
+    return index
+def faiss_eval_search(
+            index, xq, xb, nprobe_tab, pre_reorder_k_tab,
+            k, gt, nrun, measure
+    ):
+    import faiss
+    print("use precomputed table=", index.use_precomputed_table,
+          "by residual=", index.by_residual)
+    print("adding a refine index")
+    index_refine = faiss.IndexRefineFlat(index, faiss.swig_ptr(xb))
+    print("set single thread")
+    faiss.omp_set_num_threads(1)
+    print("warmup")
+    for _run in range(5):
+        index.search(xq, k)
+    print("run timing")
+    for nprobe in nprobe_tab:
+        for pre_reorder_k in pre_reorder_k_tab:
+            index.nprobe = nprobe
+            times = []
+            for _run in range(nrun):
+                if pre_reorder_k == 0:
+                    t0 = time.time()
+                    D, I = index.search(xq, k)
+                    t1 = time.time()
+                else:
+                    index_refine.k_factor = pre_reorder_k / k
+                    t0 = time.time()
+                    D, I = index_refine.search(xq, k)
+                    t1 = time.time()
+                times.append(t1 - t0)
+            header = "Faiss nprobe=%4d reo=%4d" % (nprobe, pre_reorder_k)
+            if measure == "1-recall":
+                eval_recalls(header, I, gt, times)
+            else:
+                eval_inters(header, I, gt, times)
+if __name__ == "__main__":
+    main()
--- a/benchs/bench_all_ivf/datasets.py
+++ b/benchs/bench_all_ivf/datasets.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Common functions to load datasets and compute their ground-truth
+"""
+import time
+import numpy as np
+import faiss
+from faiss.contrib import datasets as faiss_datasets
+print("path:", faiss_datasets.__file__)
+faiss_datasets.dataset_basedir = '/checkpoint/matthijs/simsearch/'
+def sanitize(x):
+    return np.ascontiguousarray(x, dtype='float32')
+#################################################################
+# Dataset
+#################################################################
+class DatasetCentroids(faiss_datasets.Dataset):
+    def __init__(self, ds, indexfile):
+        self.d = ds.d
+        self.metric = ds.metric
+        self.nq = ds.nq
+        self.xq = ds.get_queries()
+        # get the xb set
+        src_index = faiss.read_index(indexfile)
+        src_quant = faiss.downcast_index(src_index.quantizer)
+        centroids = faiss.vector_to_array(src_quant.xb)
+        self.xb = centroids.reshape(-1, self.d)
+        self.nb = self.nt = len(self.xb)
+    def get_queries(self):
+        return self.xq
+    def get_database(self):
+        return self.xb
+    def get_train(self, maxtrain=None):
+        return self.xb
+    def get_groundtruth(self, k=100):
+        return faiss.knn(
+            self.xq, self.xb, k,
+            faiss.METRIC_L2 if self.metric == 'L2' else faiss.METRIC_INNER_PRODUCT
+        )[1]
+def load_dataset(dataset='deep1M', compute_gt=False, download=False):
+    print("load data", dataset)
+    if dataset == 'sift1M':
+        return faiss_datasets.DatasetSIFT1M()
+    elif dataset.startswith('bigann'):
+        dbsize = 1000 if dataset == "bigann1B" else int(dataset[6:-1])
+        return faiss_datasets.DatasetBigANN(nb_M=dbsize)
+    elif dataset.startswith("deep_centroids_"):
+        ncent = int(dataset[len("deep_centroids_"):])
+        centdir = "/checkpoint/matthijs/bench_all_ivf/precomputed_clusters"
+        return DatasetCentroids(
+            faiss_datasets.DatasetDeep1B(nb=1000000),
+            f"{centdir}/clustering.dbdeep1M.IVF{ncent}.faissindex"
+        )
+    elif dataset.startswith("deep"):
+        szsuf = dataset[4:]
+        if szsuf[-1] == 'M':
+            dbsize = 10 ** 6 * int(szsuf[:-1])
+        elif szsuf == '1B':
+            dbsize = 10 ** 9
+        elif szsuf[-1] == 'k':
+            dbsize = 1000 * int(szsuf[:-1])
+        else:
+            assert False, "did not recognize suffix " + szsuf
+        return faiss_datasets.DatasetDeep1B(nb=dbsize)
+    elif dataset == "music-100":
+        return faiss_datasets.DatasetMusic100()
+    elif dataset == "glove":
+        return faiss_datasets.DatasetGlove(download=download)
+    else:
+        assert False
+#################################################################
+# Evaluation
+#################################################################
+def evaluate_DI(D, I, gt):
+    nq = gt.shape[0]
+    k = I.shape[1]
+    rank = 1
+    while rank <= k:
+        recall = (I[:, :rank] == gt[:, :1]).sum() / float(nq)
+        print("R@%d: %.4f" % (rank, recall), end=' ')
+        rank *= 10
+def evaluate(xq, gt, index, k=100, endl=True):
+    t0 = time.time()
+    D, I = index.search(xq, k)
+    t1 = time.time()
+    nq = xq.shape[0]
+    print("\t %8.4f ms per query, " % (
+        (t1 - t0) * 1000.0 / nq), end=' ')
+    rank = 1
+    while rank <= k:
+        recall = (I[:, :rank] == gt[:, :1]).sum() / float(nq)
+        print("R@%d: %.4f" % (rank, recall), end=' ')
+        rank *= 10
+    if endl:
+        print()
+    return D, I
--- a/benchs/bench_all_ivf/make_groundtruth.py
+++ b/benchs/bench_all_ivf/make_groundtruth.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+# https://stackoverflow.com/questions/7016056/python-logging-not-outputting-anything
+logging.basicConfig()
+logger = logging.getLogger('faiss.contrib.exhaustive_search')
+logger.setLevel(logging.INFO)
+from faiss.contrib import datasets
+from faiss.contrib.exhaustive_search import knn_ground_truth
+from faiss.contrib import vecs_io
+ds = datasets.DatasetDeep1B(nb=int(1e9))
+print("computing GT matches for", ds)
+D, I = knn_ground_truth(
+    ds.get_queries(),
+    ds.database_iterator(bs=65536),
+    k=100
+)
+vecs_io.ivecs_write("/tmp/tt.ivecs", I)
--- a/benchs/bench_all_ivf/parse_bench_all_ivf.py
+++ b/benchs/bench_all_ivf/parse_bench_all_ivf.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import numpy as np
+from collections import defaultdict
+from matplotlib import pyplot
+import re
+from argparse import Namespace
+from faiss.contrib.factory_tools import get_code_size as unitsize
+def dbsize_from_name(dbname):
+    sufs = {
+        '1B': 10**9,
+        '100M': 10**8,
+        '10M': 10**7,
+        '1M': 10**6,
+    }
+    for s in sufs:
+        if dbname.endswith(s):
+            return sufs[s]
+    else:
+        assert False
+def keep_latest_stdout(fnames):
+    fnames = [fname for fname in fnames if fname.endswith('.stdout')]
+    fnames.sort()
+    n = len(fnames)
+    fnames2 = []
+    for i, fname in enumerate(fnames):
+        if i + 1 < n and fnames[i + 1][:-8] == fname[:-8]:
+            continue
+        fnames2.append(fname)
+    return fnames2
+def parse_result_file(fname):
+    # print fname
+    st = 0
+    res = []
+    keys = []
+    stats = {}
+    stats['run_version'] = fname[-8]
+    indexkey = None
+    for l in open(fname):
+        if l.startswith("srun:"):
+            # looks like a crash...
+            if indexkey is None:
+                raise RuntimeError("instant crash")
+            break
+        elif st == 0:
+            if l.startswith("dataset in dimension"):
+                fi = l.split()
+                stats["d"] = int(fi[3][:-1])
+                stats["nq"] = int(fi[9])
+                stats["nb"] = int(fi[11])
+                stats["nt"] = int(fi[13])
+            if l.startswith('index size on disk:'):
+                stats['index_size'] = int(l.split()[-1])
+            if l.startswith('current RSS:'):
+                stats['RSS'] = int(l.split()[-1])
+            if l.startswith('precomputed tables size:'):
+                stats['tables_size'] = int(l.split()[-1])
+            if l.startswith('Setting nb of threads to'):
+                stats['n_threads'] = int(l.split()[-1])
+            if l.startswith('  add in'):
+                stats['add_time'] = float(l.split()[-2])
+            if l.startswith("vector code_size"):
+                stats['code_size'] = float(l.split()[-1])
+            if l.startswith('args:'):
+                args = eval(l[l.find(' '):])
+                indexkey = args.indexkey
+            elif "time(ms/q)" in l:
+                # result header
+                if 'R@1   R@10  R@100' in l:
+                    stats["measure"] = "recall"
+                    stats["ranks"] = [1, 10, 100]
+                elif 'I@1   I@10  I@100' in l:
+                    stats["measure"] = "inter"
+                    stats["ranks"] = [1, 10, 100]
+                elif 'inter@' in l:
+                    stats["measure"] = "inter"
+                    fi = l.split()
+                    if fi[1] == "inter@":
+                        rank = int(fi[2])
+                    else:
+                        rank = int(fi[1][len("inter@"):])
+                    stats["ranks"] = [rank]
+                else:
+                    assert False
+                st = 1
+            elif 'index size on disk:' in l:
+                stats["index_size"] = int(l.split()[-1])
+        elif st == 1:
+            st = 2
+        elif st == 2:
+            fi = l.split()
+            if l[0] == " ":
+                # means there are 0 parameters
+                fi = [""] + fi
+            keys.append(fi[0])
+            res.append([float(x) for x in fi[1:]])
+    return indexkey, np.array(res), keys, stats
+# the directory used in run_on_cluster.bash
+basedir = "/checkpoint/matthijs/bench_all_ivf/"
+logdir = basedir + 'logs/'
+def collect_results_for(db='deep1M', prefix="autotune."):
+    # run parsing
+    allres = {}
+    allstats = {}
+    missing = []
+    fnames = keep_latest_stdout(os.listdir(logdir))
+    # print fnames
+    # filenames are in the form <key>.x.stdout
+    # where x is a version number (from a to z)
+    # keep only latest version of each name
+    for fname in fnames:
+        if not (
+                'db' + db in fname and
+                fname.startswith(prefix) and
+                fname.endswith('.stdout')
+            ):
+            continue
+        print("parse", fname, end="   ", flush=True)
+        try:
+            indexkey, res, _, stats = parse_result_file(logdir + fname)
+        except RuntimeError as e:
+            print("FAIL %s" % e)
+            res = np.zeros((2, 0))
+        except Exception as e:
+            print("PARSE ERROR " + e)
+            res = np.zeros((2, 0))
+        else:
+            print(len(res), "results")
+        if res.size == 0:
+            missing.append(fname)
+        else:
+            if indexkey in allres:
+                if allstats[indexkey]['run_version'] > stats['run_version']:
+                    # don't use this run
+                    continue
+            allres[indexkey] = res
+            allstats[indexkey] = stats
+    return allres, allstats
+def extract_pareto_optimal(allres, keys, recall_idx=0, times_idx=3):
+    bigtab = []
+    for i, k in enumerate(keys):
+        v = allres[k]
+        perf = v[:, recall_idx]
+        times = v[:, times_idx]
+        bigtab.append(
+            np.vstack((
+                np.ones(times.size) * i,
+                perf, times
+            ))
+        )
+    if bigtab == []:
+        return [], np.zeros((3, 0))
+    bigtab = np.hstack(bigtab)
+    # sort by perf
+    perm = np.argsort(bigtab[1, :])
+    bigtab_sorted = bigtab[:, perm]
+    best_times = np.minimum.accumulate(bigtab_sorted[2, ::-1])[::-1]
+    selection, = np.where(bigtab_sorted[2, :] == best_times)
+    selected_keys = [
+        keys[i] for i in
+        np.unique(bigtab_sorted[0, selection].astype(int))
+    ]
+    ops = bigtab_sorted[:, selection]
+    return selected_keys, ops
+def plot_subset(
+    allres, allstats, selected_methods, recall_idx, times_idx=3,
+    report=["overhead", "build time"]):
+    # important methods
+    for k in selected_methods:
+        v = allres[k]
+        stats = allstats[k]
+        d = stats["d"]
+        dbsize = stats["nb"]
+        if "index_size" in stats and "tables_size" in stats:
+            tot_size = stats['index_size'] + stats['tables_size']
+        else:
+            tot_size = -1
+        id_size = 8 # 64 bit
+        addt = ''
+        if 'add_time' in stats:
+            add_time = stats['add_time']
+            if add_time > 7200:
+                add_min = add_time / 60
+                addt = ', %dh%02d' % (add_min / 60, add_min % 60)
+            else:
+                add_sec = int(add_time)
+                addt = ', %dm%02d' % (add_sec / 60, add_sec % 60)
+        code_size = unitsize(d, k)
+        label = k
+        if "code_size" in report:
+            label += " %d bytes" % code_size
+        tight_size = (code_size + id_size) * dbsize
+        if tot_size < 0 or "overhead" not in report:
+            pass # don't know what the index size is
+        elif tot_size > 10 * tight_size:
+            label += " overhead x%.1f" % (tot_size / tight_size)
+        else:
+            label += " overhead+%.1f%%" % (
+                tot_size / tight_size * 100 - 100)
+        if "build time" in report:
+            label += " " + addt
+        linestyle = (':' if 'Refine' in k or 'RFlat' in k else
+                     '-.' if 'SQ' in k else
+                     '-' if '4fs' in k else
+                     '-')
+        print(k, linestyle)
+        pyplot.semilogy(v[:, recall_idx], 1000 / v[:, times_idx], label=label,
+                        linestyle=linestyle,
+                        marker='o' if '4fs' in k else '+')
+    recall_rank = stats["ranks"][recall_idx]
+    if stats["measure"] == "recall":
+        pyplot.xlabel('1-recall at %d' % recall_rank)
+    elif stats["measure"] == "inter":
+        pyplot.xlabel('inter @ %d' % recall_rank)
+    else:
+        assert False
+    pyplot.ylabel('QPS (%d threads)' % stats["n_threads"])
+def plot_tradeoffs(db, allres, allstats, code_size, recall_rank):
+    stat0 = next(iter(allstats.values()))
+    d = stat0["d"]
+    n_threads = stat0["n_threads"]
+    recall_idx = stat0["ranks"].index(recall_rank)
+    # times come after the perf measure
+    times_idx = len(stat0["ranks"])
+    if type(code_size) == int:
+        if code_size == 0:
+            code_size = [0, 1e50]
+            code_size_name = "any code size"
+        else:
+            code_size_name = "code_size=%d" % code_size
+            code_size = [code_size, code_size]
+    elif type(code_size) == tuple:
+        code_size_name = "code_size in [%d, %d]" % code_size
+    else:
+        assert False
+    names_maxperf = []
+    for k in sorted(allres):
+        v = allres[k]
+        if v.ndim != 2: continue
+        us = unitsize(d, k)
+        if not code_size[0] <= us <= code_size[1]: continue
+        names_maxperf.append((v[-1, recall_idx], k))
+    # sort from lowest to highest topline accuracy
+    names_maxperf.sort()
+    names = [name for mp, name in names_maxperf]
+    selected_methods, optimal_points =  \
+        extract_pareto_optimal(allres, names, recall_idx, times_idx)
+    not_selected = list(set(names) - set(selected_methods))
+    print("methods without an optimal OP: ", not_selected)
+    pyplot.title('database ' + db + ' ' + code_size_name)
+    # grayed out lines
+    for k in not_selected:
+        v = allres[k]
+        if v.ndim != 2: continue
+        us = unitsize(d, k)
+        if not code_size[0] <= us <= code_size[1]: continue
+        linestyle = (':' if 'PQ' in k else
+                     '-.' if 'SQ4' in k else
+                     '--' if 'SQ8' in k else '-')
+        pyplot.semilogy(v[:, recall_idx], 1000 / v[:, times_idx], label=None,
+                        linestyle=linestyle,
+                        marker='o' if 'HNSW' in k else '+',
+                        color='#cccccc', linewidth=0.2)
+    plot_subset(allres, allstats, selected_methods, recall_idx, times_idx)
+    if len(not_selected) == 0:
+        om = ''
+    else:
+        om = '\nomitted:'
+        nc = len(om)
+        for m in not_selected:
+            if nc > 80:
+                om += '\n'
+                nc = 0
+            om += ' ' + m
+            nc += len(m) + 1
+    # pyplot.semilogy(optimal_points[1, :], optimal_points[2, :], marker="s")
+    # print(optimal_points[0, :])
+    pyplot.xlabel('1-recall at %d %s' % (recall_rank, om) )
+    pyplot.ylabel('QPS (%d threads)' % n_threads)
+    pyplot.legend()
+    pyplot.grid()
+    return selected_methods, not_selected
+if __name__ == "__main__xx":
+    # tests on centroids indexing (v1)
+    for k in 1, 32, 128:
+        pyplot.gcf().set_size_inches(15, 10)
+        i = 1
+        for ncent in 65536, 262144, 1048576, 4194304:
+            db = f'deep_centroids_{ncent}.k{k}.'
+            allres, allstats = collect_results_for(
+                db=db, prefix="cent_index.")
+            pyplot.subplot(2, 2, i)
+            plot_subset(
+                allres, allstats, list(allres.keys()),
+                recall_idx=0,
+                times_idx=1,
+                report=["code_size"]
+            )
+            i += 1
+            pyplot.title(f"{ncent} centroids")
+            pyplot.legend()
+            pyplot.xlim([0.95, 1])
+            pyplot.grid()
+        pyplot.savefig('figs/deep1B_centroids_k%d.png' % k)
+if __name__ == "__main__xx":
+    # centroids plot per k
+    pyplot.gcf().set_size_inches(15, 10)
+    i=1
+    for ncent in 65536, 262144, 1048576, 4194304:
+        xyd = defaultdict(list)
+        for k in 1, 4, 8, 16, 32, 64, 128, 256:
+            db = f'deep_centroids_{ncent}.k{k}.'
+            allres, allstats = collect_results_for(db=db, prefix="cent_index.")
+            for indexkey, res in allres.items():
+                idx, = np.where(res[:, 0] >= 0.99)
+                if idx.size > 0:
+                    xyd[indexkey].append((k, 1000 / res[idx[0], 1]))
+        pyplot.subplot(2, 2, i)
+        i += 1
+        for indexkey, xy in xyd.items():
+            xy = np.array(xy)
+            pyplot.loglog(xy[:, 0], xy[:, 1], 'o-', label=indexkey)
+        pyplot.title(f"{ncent} centroids")
+        pyplot.xlabel("k")
+        xt = 2**np.arange(9)
+        pyplot.xticks(xt, ["%d" % x for x in xt])
+        pyplot.ylabel("QPS (32 threads)")
+        pyplot.legend()
+        pyplot.grid()
+    pyplot.savefig('../plots/deep1B_centroids_min99.png')
+if __name__ == "__main__xx":
+    # main indexing plots
+    i = 0
+    for db in 'bigann10M', 'deep10M', 'bigann100M', 'deep100M', 'deep1B', 'bigann1B':
+        allres, allstats = collect_results_for(
+            db=db, prefix="autotune.")
+        for cs in 8, 16, 32, 64:
+            pyplot.figure(i)
+            i += 1
+            pyplot.gcf().set_size_inches(15, 10)
+            cs_range = (
+                (0, 8) if cs == 8 else (cs // 2 + 1, cs)
+            )
+            plot_tradeoffs(
+                db, allres, allstats, code_size=cs_range, recall_rank=1)
+            pyplot.savefig('../plots/tradeoffs_%s_cs%d_r1.png' % (
+                   db, cs))
+if __name__ == "__main__":
+    # 1M indexes
+    i = 0
+    for db in "glove", "music-100":
+        pyplot.figure(i)
+        pyplot.gcf().set_size_inches(15, 10)
+        i += 1
+        allres, allstats = collect_results_for(db=db, prefix="autotune.")
+        plot_tradeoffs(db, allres, allstats, code_size=0, recall_rank=1)
+        pyplot.savefig('../plots/1M_tradeoffs_' + db + ".png")
+    for db in "sift1M", "deep1M":
+        allres, allstats = collect_results_for(db=db, prefix="autotune.")
+        pyplot.figure(i)
+        pyplot.gcf().set_size_inches(15, 10)
+        i += 1
+        plot_tradeoffs(db, allres, allstats, code_size=(0, 64), recall_rank=1)
+        pyplot.savefig('../plots/1M_tradeoffs_' + db + "_small.png")
+        pyplot.figure(i)
+        pyplot.gcf().set_size_inches(15, 10)
+        i += 1
+        plot_tradeoffs(db, allres, allstats, code_size=(65, 10000), recall_rank=1)
+        pyplot.savefig('../plots/1M_tradeoffs_' + db + "_large.png")
+if __name__ == "__main__xx":
+    db = 'sift1M'
+    allres, allstats = collect_results_for(db=db, prefix="autotune.")
+    pyplot.gcf().set_size_inches(15, 10)
+    keys = [
+        "IVF1024,PQ32x8",
+        "IVF1024,PQ64x4",
+        "IVF1024,PQ64x4fs",
+        "IVF1024,PQ64x4fsr",
+        "IVF1024,SQ4",
+        "IVF1024,SQ8"
+    ]
+    plot_subset(allres, allstats, keys, recall_idx=0, report=["code_size"])
+    pyplot.legend()
+    pyplot.title(db)
+    pyplot.xlabel("1-recall@1")
+    pyplot.ylabel("QPS (32 threads)")
+    pyplot.grid()
+    pyplot.savefig('../plots/ivf1024_variants.png')
+    pyplot.figure(2)
+    pyplot.gcf().set_size_inches(15, 10)
+    keys = [
+        "HNSW32",
+        "IVF1024,PQ64x4fs",
+        "IVF1024,PQ64x4fsr",
+        "IVF1024,PQ64x4fs,RFlat",
+        "IVF1024,PQ64x4fs,Refine(SQfp16)",
+        "IVF1024,PQ64x4fs,Refine(SQ8)",
+    ]
+    plot_subset(allres, allstats, keys, recall_idx=0, report=["code_size"])
+    pyplot.legend()
+    pyplot.title(db)
+    pyplot.xlabel("1-recall@1")
+    pyplot.ylabel("QPS (32 threads)")
+    pyplot.grid()
+    pyplot.savefig('../plots/ivf1024_rerank.png')
--- a/benchs/bench_all_ivf/run_on_cluster_generic.bash
+++ b/benchs/bench_all_ivf/run_on_cluster_generic.bash
--- a/benchs/bench_for_interrupt.py
+++ b/benchs/bench_for_interrupt.py