From scikit-learn to Faiss: Migrating PCA for Scalable Vector Search
Why using Faiss
Faiss is a high‑performance library for vector similarity search and related primitives (clustering, compression, linear transforms like PCA). t scales to millions–billions of vectors on CPU and GPU and it is a much faster implementation of PCA. In practice this reduces memory, latency, and Python overhead.
Why migrate PCA to Faiss?
If you’re already using scikit-learn for training, why switch to Faiss for deployment?
- Training PCA in sklearn is convenient, but for the deployment implementation is slow.
- Faiss offers faster, more efficient kernels for applying PCA at scale.
- You can migrate a trained sklearn.PCA to a faiss.PCAMatrix without retraining.
Principal Component Analysis
PCA (Principal Component Analysis) is a linear dimensionality reduction technique. It projects data into a lower-dimensional space using the eigenvectors of the covariance matrix. You can check this video for a detail exaplanation.
Sklearn
We’ll focus on the essential operation of PCA: projecting vectors using transform()
.
Given :
- X as the input data
- skl_pca as the trained PCA object from sklearn
You can project X into the PCA-transformed space like this:
X_transformed = X @ skl_pca.components_.T
If whitening was applied during PCA fitting, you’ll also need to scale the output:
scale = xp.sqrt(skl_pca.explained_variance_)
min_scale = xp.finfo(scale.dtype).eps
scale[scale < min_scale] = min_scale
X_transformed /= scale
For reference, see the official implementation.
Faiss
In Faiss, after training a PCAMatrix
, the transformation looks slightly different:
X_transformed = X @ faiss_pca.A.T + faiss_pca.b
Here, A
is the components matrix, and b
is a bias vector.
Migrating from sklearn
to Faiss
To migrate from a trained sklearn.PCA
model to a faiss.PCAMatrix
, you need to extract:
A
: the transformed components matrixb
: the bias vector to match sklearn’s behavior
Depending on whether whitening is used:
Without whitening:
A = skl_pca.components_
b = -skl_pca.mean_ @ A.T
With whiten=True
:
# escala cada fila de skl_pca.components_
A = skl_pca.components_ / sqrt(skl_pca.explained_variance_)[:, None]
b = -skl_pca.mean_ @ A.T
After these definitions we can get:
X @ A.T + b == sklearn.PCA.transform(X)
Code
Let’s create a small PCA model using the USPS digits dataset:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
X, y = fetch_openml(data_id=41082, as_frame=False, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=3_000)
skl_pca = PCA(n_components=32, random_state=42)
skl_pca.fit(X_train)
Now, let’s migrate the trained model to Faiss:
import sklearn
import faiss
import numpy as np
def sklearn_pca_to_faiss(skl_pca) -> faiss.PCAMatrix:
d_in = skl_pca.components_.shape[1]
d_out = skl_pca.n_components_
# Build A: rows are components; include whitening if requested
if getattr(skl_pca, "whiten", False):
scale = np.sqrt(skl_pca.explained_variance_)[:, None]
A = (skl_pca.components_ / scale).astype(np.float32)
else:
A = skl_pca.components_.astype(np.float32)
faiss_pca = faiss.PCAMatrix(d_in, d_out, 0.0, False) # eigen_power handled manually
faiss.copy_array_to_vector(A.reshape(-1), faiss_pca.A)
mean = skl_pca.mean_.astype(np.float32)
faiss.copy_array_to_vector(mean.reshape(-1), faiss_pca.mean)
# Choose bias so that X @ A^T + b == (X - mean) @ A^T
b = -mean @ A.T # shape (d_out,)
faiss.copy_array_to_vector(b.reshape(-1), faiss_pca.b)
faiss_pca.is_trained = True
return faiss_pca
faiss_pca = sklearn_pca_to_faiss(skl_pca)
Important: Use Faiss’s copy_array_to_vector
utility to load arrays into Faiss structures. See this file for implementation details.
Validation
Always validate that the migration preserves results:
import numpy as np
import faiss
import sklearn
import time
X = np.random.randn(1_000_000, faiss_pca.d_in).astype(np.float32)
# Check over some random vectors
np.testing.assert_allclose(skl_pca.transform(X), faiss_pca.apply_py(X), atol=1e-5)
# Check over train vectors
np.testing.assert_allclose(skl_pca.transform(X_train), faiss_pca.apply_py(X_train), atol=1e-5)
# Check over test vectors
np.testing.assert_allclose(skl_pca.transform(X_test), faiss_pca.apply_py(X_test), atol=1e-5)
print("OK: sklearn == faiss")
# sklearn
t0 = time.perf_counter(); _ = skl_pca.transform(X); t1 = time.perf_counter()
# faiss
t2 = time.perf_counter(); _ = faiss_pca.apply_py(X); t3 = time.perf_counter()
print(f"sklearn.transform: {(t1-t0):.3f}s | {(X.shape[0]/(t1-t0)):.0f} vec/s")
print(f"faiss.apply_py : {(t3-t2):.3f}s | {(X.shape[0]/(t3-t2)):.0f} vec/s")
print(f"Speedup: {((t1-t0)/(t3-t2)):.1f}x")
In this example, a 1.2x speedup was achieved. See the complete code here.
Conclusion
Migrating from scikit-learn
to Faiss
for PCA application is a straightforward optimization with real-world impact. You can keep sklearn for training and validation, then deploy the exact same projection using Faiss—boosting inference performance without retraining.
This method is simple, deterministic, and production-ready. And with just a few lines of code, you bridge the gap between experimentation and scalable deployment.