evaluation_utilsforecast.mdx

---
title: "Evaluation Pipeline"
description: "Learn how to evaluate TimeGPT model performance using tools in utilforecast"
icon: "square-root-variable"
---

## Overview

After generating forecasts with TimeGPT, the next step is to evaluate how accurate those forecasts are. The evaluate function from the utilsforecast library provides a fast and flexible way to assess model performance using a wide range of metrics. This pipeline works seamlessly with TimeGPT and other forecasting models.  
With the evaluation pipeline, you can easily select models and define metrics like MAE, MSE, or MAPE to benchmark forecasting performance.

## Step-to-Step Guide

### Step 1. Import Required Packages

Start by importing the necessary libraries and initializing the `NixtlaClient` with your API key.

```python
import pandas as pd
from nixtla import NixtlaClient
from functools import partial
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import mae, mse, rmse, mape, smape, mase, scaled_crps

nixtla_client = NixtlaClient(api_key='your_api_key_here')
```

### Step 2. Load and Prepare the Dataset

For this example, we use the Air Passenger dataset, which records monthly totals of international airline passengers. We'll load the dataset, format the timestamps, and split the data into a training set and a test set. The last 12 months are used for testing.

```python
df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/air_passengers.csv')
df['unique_id'] = 'passengers'
df['timestamp'] = pd.to_datetime(df['timestamp'])
```

```python
df_train = df.iloc[:-12]
df_test = df.iloc[-12:]
```

### Step 3. Generate Forecast with TimeGPT

Next, we will:

- Use the training set to generate a 12-step forecast with TimeGPT.
- Merge the forecast with the test set for evaluation.

```python
fcst_timegpt = nixtla_client.forecast(df = df_train,
                                      h=12,
                                      time_col = 'timestamp',
                                      target_col = 'value',
                                      level=[80, 95])
fcst_timegpt = fcst_timegpt.merge(df_test, on = ['timestamp','unique_id'])
```

### Step 4. Define Models and Metrics for Evaluation

Next, we define the models to evaluate and the metrics to use. For more information about supported metrics, refer to the [evaluation metrics tutorial](forecasting/evaluation/evaluation_metrics) .

```python
models = ['TimeGPT']
metrics = [
           mae,
           mse,
           rmse,
           mape,
           smape,
           partial(mase, seasonality=12),
           scaled_crps
           ]
```

### Step 5. Run the Evaluation

Finally, call the evaluate function with your merged forecast results. Include `train_df` for metrics that need the training data and `level` if using probabilistic metrics.

```python
evaluation = evaluate(
    fcst_timegpt,
    target_col = 'value',
    time_col = 'timestamp',
    metrics=metrics,
    models=model,
    train_df=df_train,
    level=[80, 95]
)
```

| unique_id  | metric      | TimeGPT  |
| ---------- | ----------- | -------- |
| passengers | mae         | 12.67930 |
| passengers | mse         | 213.9358 |
| passengers | rmse        | 14.62654 |
| passengers | mape        | 0.026964 |
| passengers | smape       | 0.013527 |
| passengers | mase        | 0.416397 |
| passengers | scaled_crps | 0.008991 |