dask.mdx 5.95 KB
Newer Older
bailuo's avatar
readme  
bailuo committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: "Time Series Forecasting with Dask"
description: "Scale pandas workflows with Dask and TimeGPT for distributed time series forecasting. Learn to process 10M+ time series in Python with minimal code changes."
icon: "server"
---

## Overview

[Dask](https://www.dask.org/) is an open-source parallel computing library for Python that scales pandas workflows seamlessly. This guide explains how to use TimeGPT from Nixtla with Dask for distributed forecasting tasks.

Dask is ideal when you're already using pandas and need to scale beyond single-machine memory limits—typically for datasets with 10-100 million observations across multiple time series. Unlike Spark, Dask requires minimal code changes from your existing pandas workflow.

## Why Use Dask for Time Series Forecasting?

Dask offers unique advantages for scaling time series forecasting:

- **Pandas-like API**: Minimal code changes from your existing pandas workflows
- **Easy scaling**: Convert pandas DataFrames to Dask with a single line of code
- **Python-native**: Pure Python implementation, no JVM required (unlike Spark)
- **Flexible deployment**: Run on your laptop or scale to a cluster
- **Memory efficiency**: Process datasets larger than RAM through intelligent chunking

Choose Dask when you need to scale from 10 million to 100 million observations and want the smoothest transition from pandas.

**What you'll learn:**

- Simplify distributed computing with Fugue
- Run TimeGPT at scale on a Dask cluster
- Seamlessly convert pandas DataFrames to Dask

## Prerequisites

Before proceeding, make sure you have an [API key from Nixtla](/setup/setting_up_your_api_key).

## How to Use TimeGPT with Dask

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/tutorials/17_computing_at_scale_dask_distributed.ipynb)

### Step 1: Install Fugue and Dask

Fugue provides an easy-to-use interface for distributed computing over frameworks like Dask.

You can install Fugue with:

```bash
pip install fugue[dask]
```

If running on a distributed Dask cluster, ensure the `nixtla` library is installed on all worker nodes.

### Step 2: Load Your Data

You can start by loading data into a pandas DataFrame. In this example, we use hourly electricity prices from multiple markets:

```python
import pandas as pd

df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
    parse_dates=['ds'],
)
df.head()
```

Example pandas DataFrame:

    |       | unique_id   | ds                    | y       |
| ----- | ----------- | --------------------- | ------- |
| 0     | BE          | 2016-10-22 00:00:00   | 70.00   |
| 1     | BE          | 2016-10-22 01:00:00   | 37.10   |
| 2     | BE          | 2016-10-22 02:00:00   | 37.10   |
| 3     | BE          | 2016-10-22 03:00:00   | 44.75   |
| 4     | BE          | 2016-10-22 04:00:00   | 37.10   |

### Step 3: Import Dask

Convert the pandas DataFrame into a Dask DataFrame for parallel processing.

```python
import dask.dataframe as dd

dask_df = dd.from_pandas(df, npartitions=2)
dask_df
```

When converting to a Dask DataFrame, you can specify the number of partitions based on your data size or system resources.

### Step 4: Use TimeGPT on Dask

To use TimeGPT with Dask, provide a Dask DataFrame to Nixtla's client methods instead of a pandas DataFrame.


Instantiate the `NixtlaClient` class to interact with Nixtla's API:

```python
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    api_key='my_api_key_provided_by_nixtla'
)
```

You can use any method from the `NixtlaClient`, such as `forecast` or `cross_validation`.

<Tabs>
  <Tab title="Forecast Example">
    ```python Forecast with TimeGPT and Dask
    fcst_df = nixtla_client.forecast(dask_df, h=12)
    fcst_df.compute().head()
    ```
    |       | unique_id   | ds                    | TimeGPT     |
| ----- | ----------- | --------------------- | ----------- |
| 0     | BE          | 2016-12-31 00:00:00   | 45.190453   |
| 1     | BE          | 2016-12-31 01:00:00   | 43.244446   |
| 2     | BE          | 2016-12-31 02:00:00   | 41.958389   |
| 3     | BE          | 2016-12-31 03:00:00   | 39.796486   |
| 4     | BE          | 2016-12-31 04:00:00   | 39.204533   |


  </Tab>
  <Tab title="Cross-validation Example">
    ```python Cross-validation with TimeGPT and Dask
    cv_df = nixtla_client.cross_validation(
        dask_df,
        h=12,
        n_windows=5,
        step_size=2
    )
    cv_df.compute().head()
    ```
    |       | unique_id   | ds                    | cutoff                | TimeGPT     |
| ----- | ----------- | --------------------- | --------------------- | ----------- |
| 0     | BE          | 2016-12-30 04:00:00   | 2016-12-30 03:00:00   | 39.375439   |
| 1     | BE          | 2016-12-30 05:00:00   | 2016-12-30 03:00:00   | 40.039215   |
| 2     | BE          | 2016-12-30 06:00:00   | 2016-12-30 03:00:00   | 43.455849   |
| 3     | BE          | 2016-12-30 07:00:00   | 2016-12-30 03:00:00   | 47.716408   |
| 4     | BE          | 2016-12-30 08:00:00   | 2016-12-30 03:00:00   | 50.316650   |


  </Tab>
</Tabs>



## Working with Exogenous Variables

TimeGPT with Dask also supports exogenous variables. Refer to the [Exogenous Variables Tutorial](/forecasting/exogenous-variables/numeric_features) for details. Simply substitute pandas DataFrames with Dask DataFrames—the API remains identical.

## Related Resources

Explore more distributed forecasting options:
- [Distributed Computing Overview](/forecasting/forecasting-at-scale/computing_at_scale) - Compare Spark, Dask, and Ray
- [Spark Integration](/forecasting/forecasting-at-scale/spark) - For datasets with 100M+ observations
- [Ray Integration](/forecasting/forecasting-at-scale/ray) - For ML pipeline integration
- [Fine-tuning TimeGPT](/forecasting/fine-tuning/steps) - Improve accuracy at scale
- [Cross-Validation](/forecasting/evaluation/cross_validation) - Validate distributed forecasts