Complete Workflow Example

This example demonstrates the typical PyIndexNum workflow for calculating economic indices from raw price data.

Sample Data

Let’s start with some sample price data for three products over several months:

import polars as pl
import pyindexnum as pin

# Create sample data
data = {
    "date": [
        "2023-01-01", "2023-01-15", "2023-02-01", "2023-02-15", "2023-03-01",
        "2023-01-01", "2023-01-15", "2023-02-01", "2023-02-15", "2023-03-01",
        "2023-01-01", "2023-01-15", "2023-02-01", "2023-02-15", "2023-03-01"
    ],
    "product": ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"],
    "price": [100, 105, 110, 108, 115, 200, 195, 210, 205, 220, 50, 52, 48, 51, 49],
    "quantity": [10, 12, 11, 13, 10, 5, 6, 4, 7, 5, 20, 18, 22, 19, 21]
}

df = pl.DataFrame(data)
print(df)

Output:

shape: (15, 4)
┌────────────┬─────────┬───────┬──────────┐
│ date       ┆ product ┆ price ┆ quantity │
│ ---        ┆ ---     ┆ ---   ┆ ---      │
│ str        ┆ str     ┆ i64   ┆ i64      │
╞════════════╪═════════╪═══════╪══════════╡
│ 2023-01-01 ┆ A       ┆ 100   ┆ 10       │
│ 2023-01-15 ┆ A       ┆ 105   ┆ 12       │
│ 2023-02-01 ┆ A       ┆ 110   ┆ 11       │
│ 2023-02-15 ┆ A       ┆ 108   ┆ 13       │
│ 2023-03-01 ┆ A       ┆ 115   ┆ 10       │
│ ...        ┆ ...     ┆ ...   ┆ ...      │
│ 2023-03-01 ┆ C       ┆ 49    ┆ 21       │
└────────────┴─────────┴───────┴──────────┘

Step 1: Standardize Column Names

First, standardize the column names to match PyIndexNum’s expected format:

df_std = pin.standardize_columns(
    df,
    date_col="date",
    price_col="price",
    id_col="product",
    quantity_col="quantity"
)
print(df_std.head())

Output:

shape: (5, 4)
┌────────────┬───────┬────────────┬──────────┐
│ date       ┆ price ┆ product_id ┆ quantity │
│ ---        ┆ ---   ┆ ---        ┆ ---      │
│ date       ┆ i64   ┆ str        ┆ i64      │
╞════════════╪═══════╪════════════╪══════════╡
│ 2023-01-01 ┆ 100   ┆ A          ┆ 10       │
│ 2023-01-15 ┆ 105   ┆ A          ┆ 12       │
│ 2023-02-01 ┆ 110   ┆ A          ┆ 11       │
│ 2023-02-15 ┆ 108   ┆ A          ┆ 13       │
│ 2023-03-01 ┆ 115   ┆ A          ┆ 10       │
└────────────┴───────┴────────────┴──────────┘

Step 2: Aggregate Time Series

Aggregate the high-frequency data to monthly periods:

df_agg = pin.aggregate_time(
    df_std,
    freq="1mo",
    agg_type="arithmetic"  # Arithmetic mean for prices
)
print(df_agg)

Output:

shape: (9, 4)
┌────────────┬────────────┬───────────────────┬─────────────────────┐
│ product_id ┆ period     ┆ aggregated_price  ┆ aggregated_quantity │
│ ---        ┆ ---        ┆ ---               ┆ ---                 │
│ str        ┆ date       ┆ f64               ┆ f64                 │
╞════════════╪════════════╪═══════════════════╪═════════════════════╡
│ A          ┆ 2023-01-01 ┆ 102.5             ┆ 11.0                │
│ A          ┆ 2023-02-01 ┆ 109.0             ┆ 12.0                │
│ A          ┆ 2023-03-01 ┆ 115.0             ┆ 10.0                │
│ B          ┆ 2023-01-01 ┆ 197.5             ┆ 5.5                 │
│ B          ┆ 2023-02-01 ┆ 207.5             ┆ 5.5                 │
│ B          ┆ 2023-03-01 ┆ 220.0             ┆ 5.0                 │
│ C          ┆ 2023-01-01 ┆ 51.0              ┆ 19.0                │
│ C          ┆ 2023-02-01 ┆ 49.5              ┆ 20.5                │
│ C          ┆ 2023-03-01 ┆ 49.0              ┆ 21.0                │
└────────────┴────────────┴────────────┴─────────────────────┘

Step 3: Handle Unbalanced Data (Optional)

For this example, our data is already balanced. If you had missing data, you could either remove unbalanced products:

df_balanced = pin.remove_unbalanced(df_agg)

Or impute missing values:

df_imputed = pin.carry_forward_imputation(
    df_agg,
    value_cols=["aggregated_price", "aggregated_quantity"]
)

Step 4: Calculate Bilateral Indices

Calculate indices comparing two specific periods. Let’s compare January to February:

# Filter to January and February data
df_two_periods = df_agg.filter(
    pl.col("period").is_in([
        pl.date(2023, 1, 1),
        pl.date(2023, 2, 1)
    ])
)

# Calculate various bilateral indices
laspeyres_idx = pin.laspeyres(df_two_periods)
paasche_idx = pin.paasche(df_two_periods)
fisher_idx = pin.fisher(df_two_periods)
tornqvist_idx = pin.tornqvist(df_two_periods)

print(f"Laspeyres Index: {laspeyres_idx:.4f}")
print(f"Paasche Index: {paasche_idx:.4f}")
print(f"Fisher Index: {fisher_idx:.4f}")
print(f"Törnqvist Index: {tornqvist_idx:.4f}")

Output:

Laspeyres Index: 1.0507
Paasche Index: 1.0511
Fisher Index: 1.0509
Törnqvist Index: 1.0509

Step 5: Calculate Multilateral Indices (Optional)

For data spanning multiple periods, use multilateral methods:

# GEKS-Fisher index for all three months
geks_fisher_idx = pin.geks_fisher(df_agg)
print(f"GEKS-Fisher Index (Jan-Mar): {geks_fisher_idx:.4f}")

GEKS-Jevons (Unweighted)

GEKS-Jevons is an unweighted multilateral index — it does not require quantity data:

# GEKS-Jevons index (no quantity column needed)
geks_jevons_idx = pin.geks_jevons(df_agg)
print(f"GEKS-Jevons Index (Jan-Mar): {geks_jevons_idx:.4f}")

Step 6: Apply Extension Methods (Optional)

For chained multilateral indices:

# Example of movement splicing for extending the index
extended_indices = pin.movement_splice(geks_fisher_idx1, geks_fisher_idx2)
print("Extended indices:", extended_indices)

Summary

This workflow demonstrates:

  1. Data Preparation: Standardizing columns and aggregating time series

  2. Data Quality: Handling unbalanced panels through removal or imputation

  3. Index Calculation: Computing bilateral indices for period-to-period comparisons

  4. Advanced Methods: Using multilateral indices for multi-period analysis

  5. Extensions: Applying splicing methods for chained multilateral indices

The choice of specific methods depends on your data characteristics and analytical requirements. Always consider the economic interpretation of different index formulas when selecting appropriate methods for your use case.