scikit-learn
Assists with building, evaluating, and deploying machine learning models using scikit-learn. Use when performing data preprocessing, feature engineering, model selection, hyperparameter tuning, cross-validation, or building pipelines for classification, regression, and clustering tasks. Trigger words: sklearn, scikit-learn, machine learning, classification, regression, pipeline, cross-validation.
Usage
Getting Started
- Install the skill using the command above
- Open your AI coding agent (Claude Code, Codex, Gemini CLI, or Cursor)
- Reference the skill in your prompt
- The AI will use the skill's capabilities automatically
Example Prompts
- "Analyze the sales data in revenue.csv and identify trends"
- "Create a visualization comparing Q1 vs Q2 performance metrics"
Documentation
Overview
Scikit-learn is a Python machine learning library that provides a consistent API for the full ML workflow: data preprocessing (scaling, encoding, imputation), model selection (classification, regression, clustering), hyperparameter tuning (grid search, randomized search), cross-validation, and pipeline construction. It supports serialization via joblib for production deployment.
Instructions
- When preprocessing data, use
ColumnTransformerto apply different transformers to numeric and categorical columns (StandardScaler, OneHotEncoder, SimpleImputer), always within a Pipeline to prevent data leakage. - When choosing models, start with fast baselines (LogisticRegression, RandomForest) and use
HistGradientBoostingClassifierfor best tabular performance, since it handles missing values natively and is faster than GradientBoosting. - When evaluating, use
cross_val_scorewith 5-fold CV instead of single train/test splits, and useclassification_report()instead of accuracy alone since accuracy is misleading on imbalanced datasets. - When tuning hyperparameters, use
RandomizedSearchCVwhen the search space exceeds 100 combinations (faster than exhaustive GridSearchCV), and useStratifiedKFoldorTimeSeriesSplitas appropriate. - When building pipelines, chain preprocessing and model steps with
Pipelineto ensure transformers fit only on training data, then serialize the full pipeline withjoblib.dump()for deployment. - When selecting features, use
permutation_importance()for model-agnostic measurement,SelectKBestfor statistical filtering, orfeature_importances_from tree-based models.
Examples
Example 1: Build a customer churn prediction pipeline
User request: "Create a model to predict which customers will churn"
Actions:
- Build a
ColumnTransformerwithStandardScalerfor numeric features andOneHotEncoderfor categorical - Create a
Pipelinewith the transformer andHistGradientBoostingClassifier - Tune hyperparameters with
RandomizedSearchCVusingStratifiedKFold - Evaluate with
classification_report()focusing on recall for the churn class
Output: A tuned churn prediction pipeline with preprocessing, model, and evaluation metrics.
Example 2: Cluster customers into segments
User request: "Segment customers based on purchasing behavior"
Actions:
- Preprocess features with
StandardScalerin a pipeline - Use
KMeanswith silhouette score analysis to determine optimal cluster count - Run
PCAfor dimensionality reduction and visualization - Profile clusters with
groupbyon original features to interpret segments
Output: Customer segments with labeled profiles and a visual cluster map.
Guidelines
- Always use
Pipelineto prevent data leakage by fitting transformers only on training data. - Use
ColumnTransformerfor mixed data types: numeric scaling and categorical encoding in one object. - Use
HistGradientBoostingClassifieroverGradientBoostingClassifiersince it is faster and handles missing values natively. - Use
cross_val_scorewith 5-fold CV rather than a single train/test split since single splits are noisy. - Use
RandomizedSearchCVwhen the search space exceeds 100 combinations. - Use
classification_report()not just accuracy, which is misleading on imbalanced datasets. - Serialize the full pipeline with
joblib, not just the model, since deployment needs preprocessing too.
Information
- Version
- 1.0.0
- Author
- terminal-skills
- Category
- Data & AI
- License
- Apache-2.0