Text Vectorization
Text vectorization
Overview
LLMs can be used solely for data preprocessing by embedding a chunk of text of arbitrary length to a fixed-dimensional vector, that can be further used with virtually any model (e.g. classification, regression, clustering, etc.).
With Scikit-Ollama you can choose from a large variety of embedding models. The quality of which you can check on leaderboards such as Huggingface's MTEB. In the following example we will work with the default nomic-embed-text
embedding model. Simply download it using the usual Ollama CLI command:
ollama pull nomic-embed-text
Example 1: Embedding the text
from skollama.models.ollama.vectorization import OllamaVectorizer
vectorizer = OllamaVectorizer(batch_size=2) # batch_size is number of parallel tasks
X = vectorizer.fit_transform(["This is a text", "This is another text"])
Example 2: Combining the vectorizer with the XGBoost classifier in a scikit-learn pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)
steps = [("Ollama", OllamaVectorizer()), ("Clf", XGBClassifier())]
clf = Pipeline(steps)
clf.fit(X_train, y_train_encoded)
yh = clf.predict(X_test)
API Reference
The following API reference only lists the parameters needed for the initialization of the estimator. The remaining methods follow the syntax of a scikit-learn transformer.
OllamaVectorizer
from skllm.models.gpt.vectorization import OllamaVectorizer
Parameter | Type | Description |
---|---|---|
model | str | Model to use, by default "text-embedding-3-small". |
batch_size | int | Number of samples per request, by default 1. |
key | Optional[str] | Estimator-specific API key; if None, retrieved from the global config, by default None. |
org | Optional[str] | Estimator-specific ORG key; if None, retrieved from the global config, by default None. |