.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_text_classification.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_text_classification.py: Text Classification with scikit-learn ------------------------------------- This example shows how you can create a Hugging Face Hub compatible repo for a text classification task using scikit-learn. We also show how you can generate a model card for the model and the task at hand. .. GENERATED FROM PYTHON SOURCE LINES 11-14 Imports ======= First we will import everything required for the rest of this document. .. GENERATED FROM PYTHON SOURCE LINES 14-35 .. code-block:: Python import pickle from pathlib import Path from tempfile import mkdtemp, mkstemp import pandas as pd from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import ( ConfusionMatrixDisplay, accuracy_score, classification_report, confusion_matrix, f1_score, ) from sklearn.model_selection import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from skops import card .. GENERATED FROM PYTHON SOURCE LINES 36-40 Data ==== We will use 20 newsgroups dataset from sklearn. The dataset has curated news on 20 topics. It has a training and a test split. .. GENERATED FROM PYTHON SOURCE LINES 40-49 .. code-block:: Python twenty_train = fetch_20newsgroups(subset="train", shuffle=True, random_state=42) twenty_validation = fetch_20newsgroups(subset="test", shuffle=True, random_state=42) X_train, X_test, y_train, y_test = train_test_split( twenty_train.data, twenty_train.target, test_size=0.3, random_state=42 ) .. GENERATED FROM PYTHON SOURCE LINES 50-55 Train a Model ============= To train a model, we need to convert our data first to vectors. We will use CountVectorizer in our pipeline. We will fit a Multinomial Naive Bayes model with the outputs of the vectorization. .. GENERATED FROM PYTHON SOURCE LINES 55-65 .. code-block:: Python model = Pipeline( [ ("count", CountVectorizer()), ("clf", MultinomialNB()), ] ) model.fit(X_train, y_train) .. raw:: html

Pipeline(steps=[('count', CountVectorizer()), ('clf', MultinomialNB())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters

	steps	[('count', ...), ('clf', ...)]
	transform_input	None
	memory	None
	verbose	False

CountVectorizer

?Documentation for CountVectorizer

Parameters

	input	'content'
	encoding	'utf-8'
	decode_error	'strict'
	strip_accents	None
	lowercase	True
	preprocessor	None
	tokenizer	None
	stop_words	None
	token_pattern	'(?u)\\b\\w\\w+\\b'
	ngram_range	(1, ...)
	analyzer	'word'
	max_df	1.0
	min_df	1
	max_features	None
	vocabulary	None
	binary	False
	dtype	<class 'numpy.int64'>

MultinomialNB

?Documentation for MultinomialNB

Parameters

	alpha	1.0
	force_alpha	True
	fit_prior	True
	class_prior	None

.. GENERATED FROM PYTHON SOURCE LINES 66-69 Inference ========= Let's see if the model works. .. GENERATED FROM PYTHON SOURCE LINES 69-78 .. code-block:: Python docs_new = [ "A graphics processing unit is a specialized electronic circuit designed to" " manipulate and alter memory to accelerate the creation of images in a frame" " buffer intended for output to a display device.." ] predicted = model.predict(docs_new) print(twenty_train.target[predicted[0]]) .. rst-class:: sphx-glr-script-out .. code-block:: none 4 .. GENERATED FROM PYTHON SOURCE LINES 79-82 Initialize a repository to save our files in ============================================ We will now initialize a repository and save our model .. GENERATED FROM PYTHON SOURCE LINES 82-89 .. code-block:: Python _, pkl_name = mkstemp(prefix="skops-", suffix=".pkl") with open(pkl_name, mode="bw") as f: pickle.dump(model, file=f) local_repo = mkdtemp(prefix="skops-") .. GENERATED FROM PYTHON SOURCE LINES 90-94 Create a model card =================== We now create a model card. We will see below how we can populate the model card with useful information. .. GENERATED FROM PYTHON SOURCE LINES 94-97 .. code-block:: Python model_card = card.Card(model) .. GENERATED FROM PYTHON SOURCE LINES 98-103 Add more information ==================== So far, the model card does not tell viewers a lot about the model. Therefore, we add more information about the model, like a description and what its license is. .. GENERATED FROM PYTHON SOURCE LINES 103-122 .. code-block:: Python limitations = "This model is not ready to be used in production." model_description = ( "This is a Multinomial Naive Bayes model trained on 20 news groups dataset." "Count vectorizer is used for vectorization." ) model_card_authors = "skops_user" get_started_code = ( "import pickle\nwith open(pkl_filename, 'rb') as file:\n clf = pickle.load(file)" ) citation_bibtex = "bibtex\n@inproceedings{...,year={2020}}" model_card.add( citation_bibtex=citation_bibtex, get_started_code=get_started_code, model_card_authors=model_card_authors, limitations=limitations, model_description=model_description, ) .. rst-class:: sphx-glr-script-out .. code-block:: none Card( model=Pipeline(steps=[('count', CountVectorizer()), ('clf', MultinomialNB())]), Model description/Training Procedure/Hyperparameters=TableSection(27x2), Model description/Training Procedure/Model Plot=