.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_text_classification.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_text_classification.py: Text Classification with scikit-learn ------------------------------------- This example shows how you can create a Hugging Face Hub compatible repo for a text classification task using scikit-learn. We also show how you can generate a model card for the model and the task at hand. .. GENERATED FROM PYTHON SOURCE LINES 11-14 Imports ======= First we will import everything required for the rest of this document. .. GENERATED FROM PYTHON SOURCE LINES 14-35 .. code-block:: Python import pickle from pathlib import Path from tempfile import mkdtemp, mkstemp import pandas as pd from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import ( ConfusionMatrixDisplay, accuracy_score, classification_report, confusion_matrix, f1_score, ) from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from skops import card .. GENERATED FROM PYTHON SOURCE LINES 36-40 Data ==== We will use 20 newsgroups dataset from sklearn. The dataset has curated news on 20 topics. It has a training and a test split. .. GENERATED FROM PYTHON SOURCE LINES 40-49 .. code-block:: Python twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42) twenty_validation = fetch_20newsgroups(subset="test", shuffle=True, random_state=42) X_train, X_test, y_train, y_test = train_test_split( twenty_train.data, twenty_train.target, test_size=0.3, random_state=42 ) .. GENERATED FROM PYTHON SOURCE LINES 50-55 Train a Model ============= To train a model, we need to convert our data first to vectors. We will use CountVectorizer in our pipeline. We will fit a Multinomial Naive Bayes model with the outputs of the vectorization. .. GENERATED FROM PYTHON SOURCE LINES 55-65 .. code-block:: Python model = Pipeline( [ ("count", CountVectorizer()), ("clf", MultinomialNB()), ] ) model.fit(X_train, y_train) .. raw:: html
Pipeline(steps=[('count', CountVectorizer()), ('clf', MultinomialNB())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 66-69 Inference ========= Let's see if the model works. .. GENERATED FROM PYTHON SOURCE LINES 69-78 .. code-block:: Python docs_new = [ "A graphics processing unit is a specialized electronic circuit designed to" " manipulate and alter memory to accelerate the creation of images in a frame" " buffer intended for output to a display device.." ] predicted = model.predict(docs_new) print(twenty_train.target[predicted[0]]) .. rst-class:: sphx-glr-script-out .. code-block:: none 4 .. GENERATED FROM PYTHON SOURCE LINES 79-82 Initialize a repository to save our files in ============================================ We will now initialize a repository and save our model .. GENERATED FROM PYTHON SOURCE LINES 82-89 .. code-block:: Python _, pkl_name = mkstemp(prefix="skops-", suffix=".pkl") with open(pkl_name, mode="bw") as f: pickle.dump(model, file=f) local_repo = mkdtemp(prefix="skops-") .. GENERATED FROM PYTHON SOURCE LINES 90-94 Create a model card =================== We now create a model card. We will see below how we can populate the model card with useful information. .. GENERATED FROM PYTHON SOURCE LINES 94-97 .. code-block:: Python model_card = card.Card(model) .. GENERATED FROM PYTHON SOURCE LINES 98-103 Add more information ==================== So far, the model card does not tell viewers a lot about the model. Therefore, we add more information about the model, like a description and what its license is. .. GENERATED FROM PYTHON SOURCE LINES 103-122 .. code-block:: Python limitations = "This model is not ready to be used in production." model_description = ( "This is a Multinomial Naive Bayes model trained on 20 news groups dataset." "Count vectorizer is used for vectorization." ) model_card_authors = "skops_user" get_started_code = ( "import pickle\nwith open(pkl_filename, 'rb') as file:\n clf = pickle.load(file)" ) citation_bibtex = "bibtex\n@inproceedings{...,year={2020}}" model_card.add( citation_bibtex=citation_bibtex, get_started_code=get_started_code, model_card_authors=model_card_authors, limitations=limitations, model_description=model_description, ) .. rst-class:: sphx-glr-script-out .. code-block:: none Card( model=Pipeline(steps=[('count', CountVectorizer()), ('clf', MultinomialNB())]), Model description/Training Procedure/Hyperparameters=TableSection(27x2), Model description/Training Procedure/Model Plot=