From scikit-learn to Faiss: Migrating PCA for Scalable Vector Search

Why using Faiss

Faiss is a high‑performance library for vector similarity search and related primitives (clustering, compression, linear transforms like PCA). t scales to millions–billions of vectors on CPU and GPU and it is a much faster implementation of PCA. In practice this reduces memory, latency, and Python overhead.

Why migrate PCA to Faiss?

If you’re already using scikit-learn for training, why switch to Faiss for deployment?

  • Training PCA in sklearn is convenient, but for the deployment implementation is slow.
  • Faiss offers faster, more efficient kernels for applying PCA at scale.
  • You can migrate a trained sklearn.PCA to a faiss.PCAMatrix without retraining.

Principal Component Analysis

PCA (Principal Component Analysis) is a linear dimensionality reduction technique. It projects data into a lower-dimensional space using the eigenvectors of the covariance matrix. You can check this video for a detail exaplanation.

Sklearn

We’ll focus on the essential operation of PCA: projecting vectors using transform().

Given :

  • X as the input data
  • skl_pca as the trained PCA object from sklearn

You can project X into the PCA-transformed space like this:

X_transformed = X @ skl_pca.components_.T

If whitening was applied during PCA fitting, you’ll also need to scale the output:

scale = xp.sqrt(skl_pca.explained_variance_)
min_scale = xp.finfo(scale.dtype).eps
scale[scale < min_scale] = min_scale
X_transformed /= scale

For reference, see the official implementation.

Faiss

In Faiss, after training a PCAMatrix, the transformation looks slightly different:

 X_transformed = X @ faiss_pca.A.T + faiss_pca.b

Here, A is the components matrix, and b is a bias vector.

Migrating from sklearn to Faiss

To migrate from a trained sklearn.PCA model to a faiss.PCAMatrix, you need to extract:

  • A: the transformed components matrix
  • b: the bias vector to match sklearn’s behavior

Depending on whether whitening is used:

Without whitening:

A = skl_pca.components_
b = -skl_pca.mean_ @ A.T

With whiten=True:

# escala cada fila de skl_pca.components_
A = skl_pca.components_ / sqrt(skl_pca.explained_variance_)[:, None]
b = -skl_pca.mean_ @ A.T

After these definitions we can get:

X @ A.T + b  ==  sklearn.PCA.transform(X)

Code

Let’s create a small PCA model using the USPS digits dataset:

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

X, y = fetch_openml(data_id=41082, as_frame=False, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=3_000)

skl_pca = PCA(n_components=32, random_state=42)
skl_pca.fit(X_train)

Now, let’s migrate the trained model to Faiss:

import sklearn
import faiss
import numpy as np


def sklearn_pca_to_faiss(skl_pca) -> faiss.PCAMatrix:
    d_in = skl_pca.components_.shape[1]
    d_out = skl_pca.n_components_

    # Build A: rows are components; include whitening if requested
    if getattr(skl_pca, "whiten", False):
        scale = np.sqrt(skl_pca.explained_variance_)[:, None]
        A = (skl_pca.components_ / scale).astype(np.float32)
    else:
        A = skl_pca.components_.astype(np.float32)

    faiss_pca = faiss.PCAMatrix(d_in, d_out, 0.0, False)  # eigen_power handled manually
    faiss.copy_array_to_vector(A.reshape(-1), faiss_pca.A)

    mean = skl_pca.mean_.astype(np.float32)
    faiss.copy_array_to_vector(mean.reshape(-1), faiss_pca.mean)

    # Choose bias so that X @ A^T + b == (X - mean) @ A^T
    b = -mean @ A.T  # shape (d_out,)
    faiss.copy_array_to_vector(b.reshape(-1), faiss_pca.b)

    faiss_pca.is_trained = True
    return faiss_pca

faiss_pca = sklearn_pca_to_faiss(skl_pca)

Important: Use Faiss’s copy_array_to_vector utility to load arrays into Faiss structures. See this file for implementation details.

Validation

Always validate that the migration preserves results:

import numpy as np
import faiss
import sklearn
import time

X = np.random.randn(1_000_000, faiss_pca.d_in).astype(np.float32)
# Check over some random vectors
np.testing.assert_allclose(skl_pca.transform(X), faiss_pca.apply_py(X), atol=1e-5)
# Check over train vectors
np.testing.assert_allclose(skl_pca.transform(X_train), faiss_pca.apply_py(X_train), atol=1e-5)
# Check over test vectors
np.testing.assert_allclose(skl_pca.transform(X_test), faiss_pca.apply_py(X_test), atol=1e-5)
print("OK: sklearn == faiss")

# sklearn
t0 = time.perf_counter(); _ = skl_pca.transform(X); t1 = time.perf_counter()
# faiss
t2 = time.perf_counter(); _ = faiss_pca.apply_py(X); t3 = time.perf_counter()

print(f"sklearn.transform: {(t1-t0):.3f}s  | {(X.shape[0]/(t1-t0)):.0f} vec/s")
print(f"faiss.apply_py  : {(t3-t2):.3f}s  | {(X.shape[0]/(t3-t2)):.0f} vec/s")
print(f"Speedup: {((t1-t0)/(t3-t2)):.1f}x")

In this example, a 1.2x speedup was achieved. See the complete code here.

Conclusion

Migrating from scikit-learn to Faiss for PCA application is a straightforward optimization with real-world impact. You can keep sklearn for training and validation, then deploy the exact same projection using Faiss—boosting inference performance without retraining.

This method is simple, deterministic, and production-ready. And with just a few lines of code, you bridge the gap between experimentation and scalable deployment.