.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_text_classification.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_text_classification.py: Text Classification with scikit-learn ------------------------------------- This example shows how you can create a Hugging Face Hub compatible repo for a text classification task using scikit-learn. We also show how you can generate a model card for the model and the task at hand. .. GENERATED FROM PYTHON SOURCE LINES 11-14 Imports ======= First we will import everything required for the rest of this document. .. GENERATED FROM PYTHON SOURCE LINES 14-36 .. code-block:: Python import pickle from pathlib import Path from tempfile import mkdtemp, mkstemp import pandas as pd import sklearn from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import ( ConfusionMatrixDisplay, accuracy_score, classification_report, confusion_matrix, f1_score, ) from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from skops import card, hub_utils .. GENERATED FROM PYTHON SOURCE LINES 37-41 Data ==== We will use 20 newsgroups dataset from sklearn. The dataset has curated news on 20 topics. It has a training and a test split. .. GENERATED FROM PYTHON SOURCE LINES 41-50 .. code-block:: Python twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42) twenty_validation = fetch_20newsgroups(subset="test", shuffle=True, random_state=42) X_train, X_test, y_train, y_test = train_test_split( twenty_train.data, twenty_train.target, test_size=0.3, random_state=42 ) .. GENERATED FROM PYTHON SOURCE LINES 51-56 Train a Model ============= To train a model, we need to convert our data first to vectors. We will use CountVectorizer in our pipeline. We will fit a Multinomial Naive Bayes model with the outputs of the vectorization. .. GENERATED FROM PYTHON SOURCE LINES 56-66 .. code-block:: Python model = Pipeline( [ ("count", CountVectorizer()), ("clf", MultinomialNB()), ] ) model.fit(X_train, y_train) .. raw:: html

Pipeline(steps=[('count', CountVectorizer()), ('clf', MultinomialNB())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

.. GENERATED FROM PYTHON SOURCE LINES 67-70 Inference ========= Let's see if the model works. .. GENERATED FROM PYTHON SOURCE LINES 70-79 .. code-block:: Python docs_new = [ "A graphics processing unit is a specialized electronic circuit designed to" " manipulate and alter memory to accelerate the creation of images in a frame" " buffer intended for output to a display device.." ] predicted = model.predict(docs_new) print(twenty_train.target[predicted[0]]) .. rst-class:: sphx-glr-script-out .. code-block:: none 4 .. GENERATED FROM PYTHON SOURCE LINES 80-83 Initialize a repository to save our files in ============================================ We will now initialize a repository and save our model .. GENERATED FROM PYTHON SOURCE LINES 83-98 .. code-block:: Python _, pkl_name = mkstemp(prefix="skops-", suffix=".pkl") with open(pkl_name, mode="bw") as f: pickle.dump(model, file=f) local_repo = mkdtemp(prefix="skops-") hub_utils.init( model=pkl_name, requirements=[f"scikit-learn={sklearn.__version__}"], dst=local_repo, task="text-classification", data=X_test, ) .. GENERATED FROM PYTHON SOURCE LINES 99-105 Create a model card =================== We now create a model card, and populate its metadata with information which is already provided in ``config.json``, which itself is created by the call to :func:`.hub_utils.init` above. We will see below how we can populate the model card with useful information. .. GENERATED FROM PYTHON SOURCE LINES 105-108 .. code-block:: Python model_card = card.Card(model, metadata=card.metadata_from_config(Path(local_repo))) .. GENERATED FROM PYTHON SOURCE LINES 109-114 Add more information ==================== So far, the model card does not tell viewers a lot about the model. Therefore, we add more information about the model, like a description and what its license is. .. GENERATED FROM PYTHON SOURCE LINES 114-134 .. code-block:: Python model_card.metadata.license = "mit" limitations = "This model is not ready to be used in production." model_description = ( "This is a Multinomial Naive Bayes model trained on 20 news groups dataset." "Count vectorizer is used for vectorization." ) model_card_authors = "skops_user" get_started_code = ( "import pickle\nwith open(pkl_filename, 'rb') as file:\n clf = pickle.load(file)" ) citation_bibtex = "bibtex\n@inproceedings{...,year={2020}}" model_card.add( citation_bibtex=citation_bibtex, get_started_code=get_started_code, model_card_authors=model_card_authors, limitations=limitations, model_description=model_description, ) .. rst-class:: sphx-glr-script-out .. code-block:: none Card( model=Pipeline(steps=[('count', CountVectorizer()), ('clf', MultinomialNB())]), metadata.library_name=sklearn, metadata.license=mit, metadata.tags=['sklearn', 'skops', 'text-classification'], metadata.model_format=pickle, metadata.model_file=skops-_45qv_y1.pkl, Model description/Training Procedure/Hyperparameters=TableSection(27x2), Model description/Training Procedure/Model Plot=