[ad_1]
In the remainder of this text, we’ll do a clustering evaluation of meals demand time collection. You’ll learn to:
- summarise a set of time collection utilizing characteristic extraction;
- use Okay-Means and a hierarchical methodology for time collection clustering.
The total code is obtainable on Github:
Information set
We’ll use a weekly meals gross sales time collection collected by the US Division of Agriculture. This information set incorporates details about meals gross sales by product class and subcategory. The time collection is break up by state, however we’ll use nationwide complete gross sales in every interval.
Under is a pattern of the information set:
Right here’s what the entire information appears to be like like:
Characteristic-based Time Collection Clustering
We’ll use a feature-based strategy to time collection clustering. This course of includes two fundamental steps:
- Summarise every time collection right into a set of options, reminiscent of the typical worth;
- Apply a standard clustering algorithm to the characteristic set, reminiscent of Okay-means.
Let’s do every step in flip.
Characteristic extraction utilizing tsfel
We begin by extracting a set of statistics to summarise every time collection. The aim is to transform every collection right into a small set of options.
There are a number of instruments for time collection characteristic extraction. We’ll use tsfel, which offers a aggressive efficiency relative to different approaches [3].
Right here’s how you should utilize tsfel:
import pandas as pd
import tsfel# get configuration
cfg = tsfel.get_features_by_domain()
# extract options for every meals subcategory
options = {col: tsfel.time_series_features_extractor(cfg, information[col])
for col in information}
features_df = pd.concat(options, axis=0)
This course of ends in numerous options. A few of these could also be redundant, so we feature a characteristic choice course of.
Under, we apply three operations to the characteristic set:
- normalization: convert variables right into a 0–1 worth vary;
- choice by variance: take away any variable with 0 variance;
- choice by correlation: take away any variable with a excessive correlation with one other present one.
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
from src.correlation_filter import correlation_filter# normalizing the options
features_norm_df = pd.DataFrame(MinMaxScaler().fit_transform(features_df),
columns=features_df.columns)
# eradicating options with 0 variance
min_var = VarianceThreshold(threshold=0)
min_var.match(features_norm_df)
features_norm_df = pd.DataFrame(min_var.rework(features_norm_df),
columns=min_var.get_feature_names_out())
# eradicating correlated options
features_norm_df = correlation_filter(features_norm_df, 0.9)
features_norm_df.index = information.columns
Clustering with Okay-Means
After preprocessing an information set, we’re able to cluster time collection. We summarise every collection right into a small set of unordered options. So, we are able to use any standard algorithm for clustering. A preferred alternative is Okay-means.
With Okay-means, we have to choose the variety of clusters we wish. Except we’ve some area information, there’s no apparent apriori worth for this parameter. However, we are able to perform a data-driven strategy to pick the variety of clusters. We check totally different values and choose the most effective one.
Under, we check Okay-means with as much as 24 clusters. Then, we choose the variety of clusters that maximizes the silhouette rating. This metric quantifies the cohesion of the clusters obtained.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_scorekmeans_parameters = {
'init': 'k-means++',
'n_init': 100,
'max_iter': 50,
}
n_clusters = vary(2, 25)
silhouette_coef = []
for ok in n_clusters:
kmeans = KMeans(n_clusters=ok, **kmeans_parameters)
kmeans.match(features_norm_df)
rating = silhouette_score(features_norm_df, kmeans.labels_)
silhouette_coef.append(rating)
The silhouette rating is maximized for five clusters as proven within the determine under.
We are able to draw a parallel coordinates plot to grasp the profile of every cluster. Right here’s an instance with a pattern of three options:
We are able to additionally use the details about clusters to enhance demand forecasting fashions. For instance, by constructing a mannequin for every cluster. The paper in reference [5] is an efficient instance of this strategy.
Hierarchical clustering
Hierarchical clustering is an alternative choice to Okay-means. It combines pairs of clusters iteratively, resulting in a tree-like construction. The library scipy offers an implementation for this methodology.
import scipy.cluster.hierarchy as shc# hierarchical clustering utilizing the ward methodology
clustering = shc.linkage(features_norm_df, methodology='ward')
# plotting the dendrogram
dend = shc.dendrogram(clustering,
labels=classes.values,
orientation='proper',
leaf_font_size=7)
The outcomes of a hierarchical clustering mannequin are finest visualized with a dendrogram plot:
We are able to use the dendrogram to grasp the clusters’ profiles. For instance, we are able to see that almost all canned objects are grouped (orange coloration). Oranges additionally cluster with pancake/cake mixes. These two typically go collectively in individuals’s breakfast.
[ad_2]