--- title: "Time Series Forecasting with Spark" description: "Scale enterprise time series forecasting with Spark and TimeGPT. Learn to process 100M+ observations across distributed clusters with Python and PySpark." icon: "server" --- ## Overview [Spark](https://spark.apache.org/) is an open-source distributed compute framework designed for large-scale data processing. This guide demonstrates how to use TimeGPT with Spark to perform forecasting and cross-validation across distributed clusters. Spark is ideal for enterprise environments with existing Hadoop infrastructure and datasets exceeding 100 million observations. Its robust distributed architecture handles massive-scale time series forecasting with fault tolerance and high performance. ## Why Use Spark for Time Series Forecasting? Spark offers unique advantages for enterprise-scale time series forecasting: - **Enterprise-grade scalability**: Handle datasets with 100M+ observations across distributed clusters - **Hadoop integration**: Seamlessly integrate with existing HDFS and Hadoop ecosystems - **Fault tolerance**: Automatic recovery from node failures ensures reliable computation - **Mature ecosystem**: Leverage Spark's rich ecosystem of tools and libraries - **Multi-language support**: Work with Python (PySpark), Scala, or Java Choose Spark when you have enterprise infrastructure, datasets exceeding 100 million observations, or need robust fault tolerance for mission-critical forecasting. **What you'll learn:** - Install Fugue with Spark support for distributed computing - Convert pandas DataFrames to Spark DataFrames - Run TimeGPT forecasting and cross-validation on Spark clusters ## Prerequisites Before proceeding, make sure you have an [API key from Nixtla](/setup/setting_up_your_api_key). If executing on a distributed Spark cluster, ensure the `nixtla` library is installed on all worker nodes for consistent execution. ## How to Use TimeGPT with Spark [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/tutorials/16_computing_at_scale_spark_distributed.ipynb) ### Step 1: Install Fugue and Spark Fugue provides a convenient interface to distribute Python code across frameworks like Spark. Install Fugue with Spark support: ```bash pip install fugue[spark] ``` To work with TimeGPT, make sure you have the `nixtla` library installed as well. ### Step 2: Load Your Data Load the dataset into a pandas DataFrame. In this example, we use hourly electricity price data from different markets: ```python import pandas as pd df = pd.read_csv( 'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv', parse_dates=['ds'], ) df.head() ``` ### Step 3: Initialize Spark Create a Spark session and convert your pandas DataFrame to a Spark DataFrame: ```python from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() spark_df = spark.createDataFrame(df) spark_df.show(5) ``` ### Step 4: Use TimeGPT on Spark To use TimeGPT with Spark, provide a Spark DataFrame to Nixtla's client methods instead of a pandas DataFrame. The main difference from local usage is working with Spark DataFrames instead of pandas DataFrames. Instantiate the `NixtlaClient` class to interact with Nixtla's API: ```python from nixtla import NixtlaClient nixtla_client = NixtlaClient( api_key='my_api_key_provided_by_nixtla' ) ``` You can use any method from the `NixtlaClient`, such as `forecast` or `cross_validation`. ```python fcst_df = nixtla_client.forecast(spark_df, h=12) fcst_df.show(5) ``` When using Azure AI endpoints, specify `model="azureai"`: ```python nixtla_client.forecast( spark_df, h=12, model="azureai" ) ``` The public API supports two models: `timegpt-1` (default) and `timegpt-1-long-horizon`. For long horizon forecasting, see the [long-horizon model tutorial](/forecasting/model-version/longhorizon_model). ```python cv_df = nixtla_client.cross_validation( spark_df, h=12, n_windows=5, step_size=2 ) cv_df.show(5) ``` ### Step 5: Stop Spark After completing your tasks, stop the Spark session to free resources: ```python spark.stop() ``` ## Working with Exogenous Variables TimeGPT with Spark also supports exogenous variables. Refer to the [Exogenous Variables Tutorial](/forecasting/exogenous-variables/numeric_features) for details. Simply substitute pandas DataFrames with Spark DataFrames—the API remains identical. ## Related Resources Explore more distributed forecasting options: - [Distributed Computing Overview](/forecasting/forecasting-at-scale/computing_at_scale) - Compare Spark, Dask, and Ray - [Dask Integration](/forecasting/forecasting-at-scale/dask) - For datasets with 10M-100M observations - [Ray Integration](/forecasting/forecasting-at-scale/ray) - For ML pipeline integration - [Fine-tuning TimeGPT](/forecasting/fine-tuning/steps) - Improve accuracy at scale - [Cross-Validation](/forecasting/evaluation/cross_validation) - Validate distributed forecasts