spark.mdx 5.2 KB
Newer Older
bailuo's avatar
readme  
bailuo committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "Time Series Forecasting with Spark"
description: "Scale enterprise time series forecasting with Spark and TimeGPT. Learn to process 100M+ observations across distributed clusters with Python and PySpark."
icon: "server"
---

## Overview

[Spark](https://spark.apache.org/) is an open-source distributed compute framework designed for large-scale data processing. This guide demonstrates how to use TimeGPT with Spark to perform forecasting and cross-validation across distributed clusters.

Spark is ideal for enterprise environments with existing Hadoop infrastructure and datasets exceeding 100 million observations. Its robust distributed architecture handles massive-scale time series forecasting with fault tolerance and high performance.

## Why Use Spark for Time Series Forecasting?

Spark offers unique advantages for enterprise-scale time series forecasting:

- **Enterprise-grade scalability**: Handle datasets with 100M+ observations across distributed clusters
- **Hadoop integration**: Seamlessly integrate with existing HDFS and Hadoop ecosystems
- **Fault tolerance**: Automatic recovery from node failures ensures reliable computation
- **Mature ecosystem**: Leverage Spark's rich ecosystem of tools and libraries
- **Multi-language support**: Work with Python (PySpark), Scala, or Java

Choose Spark when you have enterprise infrastructure, datasets exceeding 100 million observations, or need robust fault tolerance for mission-critical forecasting.

**What you'll learn:**

- Install Fugue with Spark support for distributed computing
- Convert pandas DataFrames to Spark DataFrames
- Run TimeGPT forecasting and cross-validation on Spark clusters

## Prerequisites

Before proceeding, make sure you have an [API key from Nixtla](/setup/setting_up_your_api_key).

If executing on a distributed Spark cluster, ensure the `nixtla` library is installed on all worker nodes for consistent execution.

## How to Use TimeGPT with Spark

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/tutorials/16_computing_at_scale_spark_distributed.ipynb)

### Step 1: Install Fugue and Spark

Fugue provides a convenient interface to distribute Python code across frameworks like Spark.

Install Fugue with Spark support:

```bash
pip install fugue[spark]
```

To work with TimeGPT, make sure you have the `nixtla` library installed as well.

### Step 2: Load Your Data

Load the dataset into a pandas DataFrame. In this example, we use hourly electricity price data from different markets:

```python
import pandas as pd

df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
    parse_dates=['ds'],
)
df.head()
```

### Step 3: Initialize Spark

Create a Spark session and convert your pandas DataFrame to a Spark DataFrame:

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark_df = spark.createDataFrame(df)
spark_df.show(5)
```

### Step 4: Use TimeGPT on Spark

To use TimeGPT with Spark, provide a Spark DataFrame to Nixtla's client methods instead of a pandas DataFrame. The main difference from local usage is working with Spark DataFrames instead of pandas DataFrames.

Instantiate the `NixtlaClient` class to interact with Nixtla's API:

```python
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    api_key='my_api_key_provided_by_nixtla'
)
```

You can use any method from the `NixtlaClient`, such as `forecast` or `cross_validation`.

<Tabs>
  <Tab title="Forecast Example">
    ```python
    fcst_df = nixtla_client.forecast(spark_df, h=12)
    fcst_df.show(5)
    ```

    When using Azure AI endpoints, specify `model="azureai"`:

    ```python
    nixtla_client.forecast(
        spark_df,
        h=12,
        model="azureai"
    )
    ```

    The public API supports two models: `timegpt-1` (default) and `timegpt-1-long-horizon`. For long horizon forecasting, see the [long-horizon model tutorial](/forecasting/model-version/longhorizon_model).
  </Tab>
  <Tab title="Cross-validation Example">
    ```python
    cv_df = nixtla_client.cross_validation(
        spark_df,
        h=12,
        n_windows=5,
        step_size=2
    )
    cv_df.show(5)
    ```
  </Tab>
</Tabs>

### Step 5: Stop Spark

After completing your tasks, stop the Spark session to free resources:

```python
spark.stop()
```

## Working with Exogenous Variables

TimeGPT with Spark also supports exogenous variables. Refer to the [Exogenous Variables Tutorial](/forecasting/exogenous-variables/numeric_features) for details. Simply substitute pandas DataFrames with Spark DataFrames—the API remains identical.

## Related Resources

Explore more distributed forecasting options:
- [Distributed Computing Overview](/forecasting/forecasting-at-scale/computing_at_scale) - Compare Spark, Dask, and Ray
- [Dask Integration](/forecasting/forecasting-at-scale/dask) - For datasets with 10M-100M observations
- [Ray Integration](/forecasting/forecasting-at-scale/ray) - For ML pipeline integration
- [Fine-tuning TimeGPT](/forecasting/fine-tuning/steps) - Improve accuracy at scale
- [Cross-Validation](/forecasting/evaluation/cross_validation) - Validate distributed forecasts