spark.mdx

---
title: "Time Series Forecasting with Spark"
description: "Scale enterprise time series forecasting with Spark and TimeGPT. Learn to process 100M+ observations across distributed clusters with Python and PySpark."
icon: "server"
---

## Overview

[Spark](https://spark.apache.org/) is an open-source distributed compute framework designed for large-scale data processing. This guide demonstrates how to use TimeGPT with Spark to perform forecasting and cross-validation across distributed clusters.

Spark is ideal for enterprise environments with existing Hadoop infrastructure and datasets exceeding 100 million observations. Its robust distributed architecture handles massive-scale time series forecasting with fault tolerance and high performance.

## Why Use Spark for Time Series Forecasting?

Spark offers unique advantages for enterprise-scale time series forecasting:

- **Enterprise-grade scalability**: Handle datasets with 100M+ observations across distributed clusters
- **Hadoop integration**: Seamlessly integrate with existing HDFS and Hadoop ecosystems
- **Fault tolerance**: Automatic recovery from node failures ensures reliable computation
- **Mature ecosystem**: Leverage Spark's rich ecosystem of tools and libraries
- **Multi-language support**: Work with Python (PySpark), Scala, or Java

Choose Spark when you have enterprise infrastructure, datasets exceeding 100 million observations, or need robust fault tolerance for mission-critical forecasting.

**What you'll learn:**

- Install Fugue with Spark support for distributed computing
- Convert pandas DataFrames to Spark DataFrames
- Run TimeGPT forecasting and cross-validation on Spark clusters

## Prerequisites

Before proceeding, make sure you have an [API key from Nixtla](/setup/setting_up_your_api_key).

If executing on a distributed Spark cluster, ensure the `nixtla` library is installed on all worker nodes for consistent execution.

## How to Use TimeGPT with Spark

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/tutorials/16_computing_at_scale_spark_distributed.ipynb)

### Step 1: Install Fugue and Spark

Fugue provides a convenient interface to distribute Python code across frameworks like Spark.

Install Fugue with Spark support:

```bash
pip install fugue[spark]
```

To work with TimeGPT, make sure you have the `nixtla` library installed as well.

### Step 2: Load Your Data

Load the dataset into a pandas DataFrame. In this example, we use hourly electricity price data from different markets:

```python
import pandas as pd

df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
    parse_dates=['ds'],
)
df.head()
```

### Step 3: Initialize Spark

Create a Spark session and convert your pandas DataFrame to a Spark DataFrame:

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark_df = spark.createDataFrame(df)
spark_df.show(5)
```

### Step 4: Use TimeGPT on Spark

To use TimeGPT with Spark, provide a Spark DataFrame to Nixtla's client methods instead of a pandas DataFrame. The main difference from local usage is working with Spark DataFrames instead of pandas DataFrames.

Instantiate the `NixtlaClient` class to interact with Nixtla's API:

```python
from nixtla import NixtlaClient

nixtla_client = NixtlaClient(
    api_key='my_api_key_provided_by_nixtla'
)
```

You can use any method from the `NixtlaClient`, such as `forecast` or `cross_validation`.

<Tabs>
  <Tab title="Forecast Example">
    ```python
    fcst_df = nixtla_client.forecast(spark_df, h=12)
    fcst_df.show(5)
    ```

    When using Azure AI endpoints, specify `model="azureai"`:

    ```python
    nixtla_client.forecast(
        spark_df,
        h=12,
        model="azureai"
    )
    ```

    The public API supports two models: `timegpt-1` (default) and `timegpt-1-long-horizon`. For long horizon forecasting, see the [long-horizon model tutorial](/forecasting/model-version/longhorizon_model).
  </Tab>
  <Tab title="Cross-validation Example">
    ```python
    cv_df = nixtla_client.cross_validation(
        spark_df,
        h=12,
        n_windows=5,
        step_size=2
    )
    cv_df.show(5)
    ```
  </Tab>
</Tabs>

### Step 5: Stop Spark

After completing your tasks, stop the Spark session to free resources:

```python
spark.stop()
```

## Working with Exogenous Variables

TimeGPT with Spark also supports exogenous variables. Refer to the [Exogenous Variables Tutorial](/forecasting/exogenous-variables/numeric_features) for details. Simply substitute pandas DataFrames with Spark DataFrames—the API remains identical.

## Related Resources

Explore more distributed forecasting options:
- [Distributed Computing Overview](/forecasting/forecasting-at-scale/computing_at_scale) - Compare Spark, Dask, and Ray
- [Dask Integration](/forecasting/forecasting-at-scale/dask) - For datasets with 10M-100M observations
- [Ray Integration](/forecasting/forecasting-at-scale/ray) - For ML pipeline integration
- [Fine-tuning TimeGPT](/forecasting/fine-tuning/steps) - Improve accuracy at scale
- [Cross-Validation](/forecasting/evaluation/cross_validation) - Validate distributed forecasts