scikit-learn models on Hugging Face Hub

This guide demonstrates how you can use this package to create a Hugging Face Hub model repository based on a scikit-learn compatible model, and how to fetch scikit-learn compatible models from the Hub and run them locally.

Imports

First we will import everything required for the rest of this document.

import json
import os
import pickle
from pathlib import Path
from tempfile import mkdtemp, mkstemp
from uuid import uuid4

import sklearn
from huggingface_hub import HfApi
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV, train_test_split

from skops import card, hub_utils

Data

Then we create some random data to train and evaluate our model.

X, y = load_breast_cancer(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print("X's summary: ", X.describe())
print("y's summary: ", y.describe())

X's summary:         mean radius  mean texture  ...  worst symmetry  worst fractal dimension
count   569.000000    569.000000  ...      569.000000               569.000000
mean     14.127292     19.289649  ...        0.290076                 0.083946
std       3.524049      4.301036  ...        0.061867                 0.018061
min       6.981000      9.710000  ...        0.156500                 0.055040
25%      11.700000     16.170000  ...        0.250400                 0.071460
50%      13.370000     18.840000  ...        0.282200                 0.080040
75%      15.780000     21.800000  ...        0.317900                 0.092080
max      28.110000     39.280000  ...        0.663800                 0.207500

[8 rows x 30 columns]
y's summary:  count    569.000000
mean       0.627417
std        0.483918
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: target, dtype: float64

Train a Model

Using the above data, we train a model. To select the model, we use HalvingGridSearchCV with a parameter grid over HistGradientBoostingClassifier.

param_grid = {
    "max_leaf_nodes": [5, 10, 15],
    "max_depth": [2, 5, 10],
}

model = HalvingGridSearchCV(
    estimator=HistGradientBoostingClassifier(),
    param_grid=param_grid,
    random_state=42,
    n_jobs=-1,
).fit(X_train, y_train)
model.score(X_test, y_test)

0.9590643274853801

Initialize a Model Repo

We now initialize a model repository locally, and push it to the hub. For that, we need to first store the model as a pickle file and pass it to the hub tools.

# The file name is not significant, here we choose to save it with a `pkl`
# extension.
_, pkl_name = mkstemp(prefix="skops-", suffix=".pkl")
with open(pkl_name, mode="bw") as f:
    pickle.dump(model, file=f)

local_repo = mkdtemp(prefix="skops-")
hub_utils.init(
    model=pkl_name,
    requirements=[f"scikit-learn={sklearn.__version__}"],
    dst=local_repo,
    task="tabular-classification",
    data=X_test,
)
if "__file__" in locals():  # __file__ not defined during docs built
    # Add this script itself to the files to be uploaded for reproducibility
    hub_utils.add_files(__file__, dst=local_repo)

We can no see what the contents of the created local repo are:

print(os.listdir(local_repo))

['skops-n6q2luss.pkl', 'config.json']

Model Card

We will now create a model card and save it. For more information about how to create a good model card, refer to the model card example. The following code uses metadata_from_config() which creates a minimal metadata object to be included in the metadata section of the model card. The configuration used by this method is stored in the config.json file which is created by the call to init().

model_card = card.Card(model, metadata=card.metadata_from_config(Path(local_repo)))
model_card.save(Path(local_repo) / "README.md")

Push to Hub

And finally, we can push the model to the hub. This requires a user access token which you can get under https://huggingface.co/settings/tokens

# you can put your own token here, or set it as an environment variable before
# running this script.
token = os.environ["HF_HUB_TOKEN"]

repo_name = f"hf_hub_example-{uuid4()}"
user_name = HfApi().whoami(token=token)["name"]
repo_id = f"{user_name}/{repo_name}"
print(f"Creating and pushing to repo: {repo_id}")

Creating and pushing to repo: skops-ci/hf_hub_example-f6aaeea8-80b1-485c-9125-26e0b9252cf5

Now we can push our files to the repo. The following function creates the remote repository if it doesn’t exist; this is controlled via the create_remote argument. Note that here we’re setting private=True, which means only people with the right permissions would see the model. Set private=False to make it visible to the public.

hub_utils.push(
    repo_id=repo_id,
    source=local_repo,
    token=token,
    commit_message="pushing files to the repo from the example!",
    create_remote=True,
    private=True,
)

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

skops-n6q2luss.pkl:   0%|          | 0.00/233k [00:00<?, ?B/s]
skops-n6q2luss.pkl: 100%|##########| 233k/233k [00:00<00:00, 581kB/s]

Upload 1 LFS files: 100%|##########| 1/1 [00:00<00:00,  2.48it/s]
Upload 1 LFS files: 100%|##########| 1/1 [00:00<00:00,  2.48it/s]

Once uploaded, other users can download and use it, unless you make the repo private. Given a repository’s name, here’s how one can download it:

repo_copy = mkdtemp(prefix="skops")
hub_utils.download(repo_id=repo_id, dst=repo_copy, token=token)
print(os.listdir(repo_copy))

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading (…)f6fbc/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]
Downloading (…)f6fbc/.gitattributes: 100%|##########| 1.48k/1.48k [00:00<00:00, 1.66MB/s]

Fetching 4 files:  25%|##5       | 1/4 [00:00<00:00,  5.73it/s]

Downloading (…)ab87df6fbc/README.md:   0%|          | 0.00/12.2k [00:00<?, ?B/s]
Downloading (…)ab87df6fbc/README.md: 100%|##########| 12.2k/12.2k [00:00<00:00, 12.9MB/s]


Downloading (…)87df6fbc/config.json:   0%|          | 0.00/4.85k [00:00<?, ?B/s]
Downloading (…)87df6fbc/config.json: 100%|##########| 4.85k/4.85k [00:00<00:00, 4.37MB/s]


Downloading skops-n6q2luss.pkl:   0%|          | 0.00/233k [00:00<?, ?B/s]
Downloading skops-n6q2luss.pkl: 100%|##########| 233k/233k [00:00<00:00, 6.59MB/s]

Fetching 4 files: 100%|##########| 4/4 [00:00<00:00, 14.89it/s]
['skops-n6q2luss.pkl', 'README.md', 'config.json', '.gitattributes']

You can also get the requirements of this repository:

print(hub_utils.get_requirements(path=repo_copy))

['scikit-learn=1.2.2']

As well as the complete configuration of the project:

print(json.dumps(hub_utils.get_config(path=repo_copy), indent=2))

{
  "sklearn": {
    "columns": [
      "mean radius",
      "mean texture",
      "mean perimeter",
      "mean area",
      "mean smoothness",
      "mean compactness",
      "mean concavity",
      "mean concave points",
      "mean symmetry",
      "mean fractal dimension",
      "radius error",
      "texture error",
      "perimeter error",
      "area error",
      "smoothness error",
      "compactness error",
      "concavity error",
      "concave points error",
      "symmetry error",
      "fractal dimension error",
      "worst radius",
      "worst texture",
      "worst perimeter",
      "worst area",
      "worst smoothness",
      "worst compactness",
      "worst concavity",
      "worst concave points",
      "worst symmetry",
      "worst fractal dimension"
    ],
    "environment": [
      "scikit-learn=1.2.2"
    ],
    "example_input": {
      "area error": [
        30.29,
        96.05,
        48.31
      ],
      "compactness error": [
        0.01911,
        0.01652,
        0.01484
      ],
      "concave points error": [
        0.01037,
        0.0137,
        0.01093
      ],
      "concavity error": [
        0.02701,
        0.02269,
        0.02813
      ],
      "fractal dimension error": [
        0.003586,
        0.001698,
        0.002461
      ],
      "mean area": [
        481.9,
        1130.0,
        748.9
      ],
      "mean compactness": [
        0.1058,
        0.1029,
        0.1223
      ],
      "mean concave points": [
        0.03821,
        0.07951,
        0.08087
      ],
      "mean concavity": [
        0.08005,
        0.108,
        0.1466
      ],
      "mean fractal dimension": [
        0.06373,
        0.05461,
        0.05796
      ],
      "mean perimeter": [
        81.09,
        123.6,
        101.7
      ],
      "mean radius": [
        12.47,
        18.94,
        15.46
      ],
      "mean smoothness": [
        0.09965,
        0.09009,
        0.1092
      ],
      "mean symmetry": [
        0.1925,
        0.1582,
        0.1931
      ],
      "mean texture": [
        18.6,
        21.31,
        19.48
      ],
      "perimeter error": [
        2.497,
        5.486,
        3.094
      ],
      "radius error": [
        0.3961,
        0.7888,
        0.4743
      ],
      "smoothness error": [
        0.006953,
        0.004444,
        0.00624
      ],
      "symmetry error": [
        0.01782,
        0.01386,
        0.01397
      ],
      "texture error": [
        1.044,
        0.7975,
        0.7859
      ],
      "worst area": [
        677.9,
        1866.0,
        1156.0
      ],
      "worst compactness": [
        0.2378,
        0.2336,
        0.2394
      ],
      "worst concave points": [
        0.1015,
        0.1789,
        0.1514
      ],
      "worst concavity": [
        0.2671,
        0.2687,
        0.3791
      ],
      "worst fractal dimension": [
        0.0875,
        0.06589,
        0.08019
      ],
      "worst perimeter": [
        96.05,
        165.9,
        124.9
      ],
      "worst radius": [
        14.97,
        24.86,
        19.26
      ],
      "worst smoothness": [
        0.1426,
        0.1193,
        0.1546
      ],
      "worst symmetry": [
        0.3014,
        0.2551,
        0.2837
      ],
      "worst texture": [
        24.64,
        26.58,
        26.0
      ]
    },
    "model": {
      "file": "skops-n6q2luss.pkl"
    },
    "model_format": "pickle",
    "task": "tabular-classification",
    "use_intelex": false
  }
}

Now you can check the contents of the repository under your user.

Update Requirements

If you update your environment and the versions of your requirements are changed, you can update the requirement in your repo by calling update_env, which automatically detects the existing installation of the current environment and updates the requirements accordingly.

hub_utils.update_env(path=local_repo, requirements=["scikit-learn"])

Delete Repository

At the end, you can also delete the repository you created using HfApi().delete_repo. For more information please refer to the documentation of huggingface_hub library.

HfApi().delete_repo(repo_id=repo_id, token=token)

Total running time of the script: ( 0 minutes 8.106 seconds)

Gallery generated by Sphinx-Gallery