categorical_features.mdx 10.7 KB
Newer Older
bailuo's avatar
readme  
bailuo committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "Categorical Variables"
description: "Learn how to incorporate external categorical variables in your TimeGPT forecasts to improve accuracy."
icon: "input-text"
---

## What Are Categorical Variables?

Categorical variables are external factors that take on a limited range of discrete values, grouping observations by categories. For example, "Sporting" or "Cultural" events in a dataset describing product demand.

By capturing unique external conditions, categorical variables enhance the predictive power of your model and can reduce forecasting error. They are easy to incorporate by merging each time series data point with its corresponding categorical data.

This tutorial demonstrates how to incorporate categorical (discrete) variables into TimeGPT forecasts.

## How to Use Categorical Variables in TimeGPT

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/tutorials/03_categorical_variables.ipynb)

### Step 1: Import Packages and Initialize the Nixtla Client

Make sure you have the necessary libraries installed: pandas, nixtla, and datasetsforecast.

```python
import pandas as pd
import os

from nixtla import NixtlaClient
from datasetsforecast.m5 import M5

# Initialize the Nixtla Client
nixtla_client = NixtlaClient(
    api_key='my_api_key_provided_by_nixtla'
)
```

### Step 2: Load M5 Data

We use the **M5 dataset** — a collection of daily product sales demands across 10 US stores — to showcase how categorical variables can improve forecasts.

Start by loading the M5 dataset and converting the date columns to datetime objects.

```python
Y_df, X_df, _ = M5.load(directory=os.getcwd())

Y_df['ds'] = pd.to_datetime(Y_df['ds'])
X_df['ds'] = pd.to_datetime(X_df['ds'])

Y_df.head(10)
```

| unique_id | ds | y |
|-----------|----|---|
| FOODS_1_001_CA_1 | 2011-01-29 | 3.0 |
| FOODS_1_001_CA_1 | 2011-01-30 | 0.0 |
| FOODS_1_001_CA_1 | 2011-01-31 | 0.0 |
| FOODS_1_001_CA_1 | 2011-02-01 | 1.0 |
| FOODS_1_001_CA_1 | 2011-02-02 | 4.0 |
| FOODS_1_001_CA_1 | 2011-02-03 | 2.0 |
| FOODS_1_001_CA_1 | 2011-02-04 | 0.0 |
| FOODS_1_001_CA_1 | 2011-02-05 | 2.0 |
| FOODS_1_001_CA_1 | 2011-02-06 | 0.0 |
| FOODS_1_001_CA_1 | 2011-02-07 | 0.0 |

Extract the categorical columns from the X_df dataframe.

```python
X_df = X_df[['unique_id', 'ds', 'event_type_1']]
X_df.head(10)
```

| unique_id | ds | event_type_1 |
|-----------|----|--------------|
| FOODS_1_001_CA_1 | 2011-01-29 | nan |
| FOODS_1_001_CA_1 | 2011-01-30 | nan |
| FOODS_1_001_CA_1 | 2011-01-31 | nan |
| FOODS_1_001_CA_1 | 2011-02-01 | nan |
| FOODS_1_001_CA_1 | 2011-02-02 | nan |
| FOODS_1_001_CA_1 | 2011-02-03 | nan |
| FOODS_1_001_CA_1 | 2011-02-04 | nan |
| FOODS_1_001_CA_1 | 2011-02-05 | nan |
| FOODS_1_001_CA_1 | 2011-02-06 | Sporting |
| FOODS_1_001_CA_1 | 2011-02-07 | nan |

Notice that there is a Sporting event on February 6, 2011, listed under `event_type_1`.

### Step 3: Prepare Data for Forecasting

We'll select a specific product to demonstrate how to incorporate categorical features into TimeGPT forecasts.

#### Select a High-Selling Product and Merge Data

Start by selecting a high-selling product and merging the data.

```python
product = 'FOODS_3_090_CA_3'

Y_df_product = Y_df.query('unique_id == @product')
X_df_product = X_df.query('unique_id == @product')

df = Y_df_product.merge(X_df_product)
df.head(10)
```

| unique_id | ds | y | event_type_1 |
|-----------|----|---|--------------|
| FOODS_3_090_CA_3 | 2011-01-29 | 108.0 | nan |
| FOODS_3_090_CA_3 | 2011-01-30 | 132.0 | nan |
| FOODS_3_090_CA_3 | 2011-01-31 | 102.0 | nan |
| FOODS_3_090_CA_3 | 2011-02-01 | 120.0 | nan |
| FOODS_3_090_CA_3 | 2011-02-02 | 106.0 | nan |
| FOODS_3_090_CA_3 | 2011-02-03 | 123.0 | nan |
| FOODS_3_090_CA_3 | 2011-02-04 | 279.0 | nan |
| FOODS_3_090_CA_3 | 2011-02-05 | 175.0 | nan |
| FOODS_3_090_CA_3 | 2011-02-06 | 186.0 | Sporting |
| FOODS_3_090_CA_3 | 2011-02-07 | 120.0 | nan |

#### One-Hot Encode Categorical Events

Encode categorical variables using one-hot encoding. One-hot encoding transforms each category into a separate column containing binary indicators (0 or 1).

```python
event_type_1_ohe = pd.get_dummies(df['event_type_1'], dtype=int)

df = pd.concat([df, event_type_1_ohe], axis=1)
df = df.drop(columns=['event_type_1'])

df.tail(10)
```

| unique_id | ds | y | Cultural | National | Religious | Sporting | nan |
|-----------|----|---|----------|----------|-----------|-----------|-----|
| FOODS_3_090_CA_3 | 2016-06-10 | 140.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-11 | 151.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-12 | 87.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-13 | 67.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-14 | 50.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-15 | 58.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-16 | 116.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-17 | 124.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-18 | 167.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-06-19 | 118.0 | 0 | 0 | 0 | 1 | 0 |

#### Prepare Future External Variables

Select future external variables for Feb 1-7, 2016.

```python
future_ex_vars_df = df.drop(columns=['y']).query("ds >= '2016-02-01' & ds <= '2016-02-07'")
```

Separate training data before Feb 1, 2016.

```python
df_train = df.query("ds < '2016-02-01'")
df_train.tail(10)
```

| unique_id | ds | y | Cultural | National | Religious | Sporting | nan |
|-----------|----|---|----------|----------|-----------|-----------|-----|
| FOODS_3_090_CA_3 | 2016-01-22 | 94.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-23 | 144.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-24 | 146.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-25 | 87.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-26 | 73.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-27 | 62.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-28 | 64.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-29 | 102.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-30 | 113.0 | 0 | 0 | 0 | 0 | 1 |
| FOODS_3_090_CA_3 | 2016-01-31 | 98.0 | 0 | 0 | 0 | 0 | 1 |


### Step 4: Forecast Product Demand

To evaluate the impact of categorical variables, we'll forecast product demand with and without them.

#### Forecast Without Categorical Variables

```python
timegpt_fcst_without_cat_vars_df = nixtla_client.forecast(
    df=df_train,
    h=7,
    level=[80, 90]
)

timegpt_fcst_without_cat_vars_df.head()
```

| unique_id | ds | TimeGPT | TimeGPT-lo-90 | TimeGPT-lo-80 | TimeGPT-hi-80 | TimeGPT-hi-90 |
|-----------|----|---------|---------------|---------------|---------------|---------------|
| FOODS_3_090_CA_3 | 2016-02-01 | 73.304092 | 53.449049 | 54.795078 | 91.813107 | 93.159136 |
| FOODS_3_090_CA_3 | 2016-02-02 | 66.335518 | 47.510669 | 50.274136 | 82.396899 | 85.160367 |
| FOODS_3_090_CA_3 | 2016-02-03 | 65.881630 | 36.218617 | 41.388896 | 90.374364 | 95.544643 |
| FOODS_3_090_CA_3 | 2016-02-04 | 72.371864 | -26.683115 | 25.097362 | 119.646367 | 171.426844 |
| FOODS_3_090_CA_3 | 2016-02-05 | 95.141045 | -2.084882 | 34.027078 | 156.255011 | 192.366971 |

Visualize the forecast without categorical variables.

```python
nixtla_client.plot(
    df[['unique_id', 'ds', 'y']].query("ds <= '2016-02-07'"),
    timegpt_fcst_without_cat_vars_df,
    max_insample_length=28,
)
```

<Frame caption="Forecast with categorical variables">
  ![Forecast with categorical variables](https://raw.githubusercontent.com/Nixtla/nixtla/readme_docs/nbs/_docs/docs/tutorials/03_categorical_variables_files/figure-markdown_strict/cell-19-output-1.png)
</Frame>

TimeGPT already provides a reasonable forecast, but it seems to somewhat underforecast the peak on the 6th of February 2016 - the day before the Super Bowl.

#### Forecast With Categorical Variables

```python
timegpt_fcst_with_cat_vars_df = nixtla_client.forecast(
    df=df_train,
    X_df=future_ex_vars_df,
    h=7,
    level=[80, 90]
)

timegpt_fcst_with_cat_vars_df.head()
```

| unique_id | ds | TimeGPT | TimeGPT-lo-90 | TimeGPT-lo-80 | TimeGPT-hi-80 | TimeGPT-hi-90 |
|-----------|----|---------|---------------|---------------|---------------|---------------|
| FOODS_3_090_CA_3 | 2016-02-01 | 70.661271 | -0.204378 | 14.593348 | 126.729194 | 141.526919 |
| FOODS_3_090_CA_3 | 2016-02-02 | 65.566941 | -20.394326 | 11.654239 | 119.479643 | 151.528208 |
| FOODS_3_090_CA_3 | 2016-02-03 | 68.510010 | -33.713710 | 6.732952 | 130.287069 | 170.733731 |
| FOODS_3_090_CA_3 | 2016-02-04 | 75.417710 | -40.974649 | 4.751767 | 146.083653 | 191.810069 |
| FOODS_3_090_CA_3 | 2016-02-05 | 97.340302 | -57.385361 | 18.253812 | 176.426792 | 252.065965 |

Visualize the forecast with categorical variables.

```python
# Visualize the forecast with categorical variables
nixtla_client.plot(
    df[['unique_id', 'ds', 'y']].query("ds <= '2016-02-07'"),
    timegpt_fcst_with_cat_vars_df,
    max_insample_length=28,
)
```
<Frame caption="Forecast with categorical variables">
  ![Forecast with categorical variables](https://raw.githubusercontent.com/Nixtla/nixtla/readme_docs/nbs/_docs/docs/tutorials/03_categorical_variables_files/figure-markdown_strict/cell-21-output-1.png)
</Frame>

## 5. Evaluate Forecast Accuracy

Finally, we calculate the **Mean Absolute Error (MAE)** for the forecasts with and without categorical variables.


```python
# Create target dataframe
df_target = df[['unique_id', 'ds', 'y']].query("ds >= '2016-02-01' & ds <= '2016-02-07'")

# Rename forecast columns
timegpt_fcst_without_cat_vars_df = timegpt_fcst_without_cat_vars_df.rename(columns={'TimeGPT': 'TimeGPT-without-cat-vars'})
timegpt_fcst_with_cat_vars_df = timegpt_fcst_with_cat_vars_df.rename(columns={'TimeGPT': 'TimeGPT-with-cat-vars'})

# Merge forecasts with target dataframe
df_target = df_target.merge(timegpt_fcst_without_cat_vars_df[['unique_id', 'ds', 'TimeGPT-without-cat-vars']])
df_target = df_target.merge(timegpt_fcst_with_cat_vars_df[['unique_id', 'ds', 'TimeGPT-with-cat-vars']])

# Compute errors
mean_absolute_errors = mae(df_target, ['TimeGPT-without-cat-vars', 'TimeGPT-with-cat-vars'])
```


| unique_id        | TimeGPT-without-cat-vars | TimeGPT-with-cat-vars |
|------------------|--------------------------|-----------------------|
| FOODS_3_090_CA_3 | 24.285649                | 20.028514             |

Including categorical variables noticeably improves forecast accuracy, reducing MAE by about 20%.

## Conclusion

Categorical variables are powerful additions to TimeGPT forecasts, helping capture valuable external factors. By properly encoding these variables and merging them with your time series, you can significantly enhance predictive performance.

Continue exploring more advanced techniques or different datasets to further improve your TimeGPT forecasting models.