The DNA of Data: Parquet, Arrow, and the Quest for Analytic Speed

By Gopi Krishna Tummala

TL;DR: The Golden Rule

Store in Parquet, compute in Arrow, convert to tensors only when needed.

This simple principle is the foundation of modern high-performance data pipelines. Parquet excels at storage efficiency and I/O optimization, while Arrow excels at in-memory computation and zero-copy operations. Together, they form the backbone of systems that can handle petabytes of data efficiently.

Key Takeaways:

Parquet = Optimized for disk storage (best compression, column pruning, data skipping)
Arrow = Optimized for in-memory compute (zero-copy, SIMD-ready, language-agnostic)
The Pattern: Store data in Parquet → Convert to Arrow once → Use Arrow for all operations → Convert to tensors only at the final step
Real-World Impact: This approach enables systems like HuggingFace Datasets to handle hundreds of GB on a laptop and scales to petabyte-scale training pipelines

Introduction: The Library Analogy

Welcome to the cutting edge of data systems. While algorithms and hardware often take center stage, the unsung hero—or villain—of modern data processing is the way we structure data itself.

Imagine you have a massive library (your dataset). If the books are organized randomly, finding a single sentence takes forever. If they are perfectly shelved by topic (columnar storage), but the pages are written in a foreign language (unoptimized format), you still hit a wall.

This is the problem that formats like Apache Parquet and Apache Arrow solve. They are complementary standards for organizing the “pages” and “shelves” of our massive data libraries to maximize analytic performance.

The Golden Rule: Store in Parquet, compute in Arrow, convert to tensors only when needed. This principle guides every high-performance data pipeline in production today.

1. The Core Duo: Disk vs. Memory 💾/🧠

The fundamental trade-off in modern analytics is the split between data at rest (on disk) and data in motion (in memory).

A. Apache Parquet: The Master of the Disk 🏆 (Data at Rest)

Parquet is a columnar storage format optimized for space efficiency and minimizing I/O operations. Think of Parquet as an archivist meticulously compressing and indexing data on permanent storage (like S3 or HDFS).

How it works (The Columnar Advantage):

Instead of storing a full row (e.g., all details for a single customer) together, Parquet stores all values for a single column (e.g., all customer names) together.

Benefit: Queries often only read a few columns. Parquet allows the system to skip reading irrelevant columns entirely (column pruning).

Key Feature: Smart Encoding & Skipping:

Parquet employs highly effective encoding techniques like:

Dictionary Encoding (DICT): Replaces repeated values with dictionary indices
Run-Length Encoding (RLE): Compresses sequences of identical values
Bit-Packed Encoding (BP): Efficiently stores small integer values

Crucially, it stores statistics like min/max values (zone maps) in its metadata for each row group (Parquet’s fundamental unit for parallelism and filtering), allowing the engine to skip entire row groups (data skipping) when filtering.

Trade-off: Parquet is space-efficient, achieving the best overall compression ratio. A compression ratio of 0.13 means the file size is only 13% of the original uncompressed data size—excellent for storage. However, getting the data ready for processing requires CPU-intensive decoding and decompression of these row groups.

Example:

Row-oriented (CSV-like):
Row 1: [Name: "Alice", Age: 30, City: "NYC", Salary: 100000]
Row 2: [Name: "Bob", Age: 25, City: "SF", Salary: 120000]

Columnar (Parquet):
Column Name: ["Alice", "Bob"]
Column Age: [30, 25]
Column City: ["NYC", "SF"]
Column Salary: [100000, 120000]

When querying for “average salary”, Parquet only reads the Salary column, skipping Name, Age, and City entirely.

B. Apache Arrow: The Speed Demon of RAM ⚡ (Data in Motion)

Arrow is a language-agnostic, in-memory columnar data format optimized for blazing-fast computation. If Parquet is the archivist, Arrow is the zero-overhead desk organizer for the CPU.

How it works (Zero-Copy):

Arrow defines a standardized memory layout that is ready for modern CPU architectures (like SIMD, or vectorization) and GPUs. Since all components of a query engine can understand this exact memory layout, data can be passed between different systems (Python/Pandas, Spark, C++, R) without costly serialization and deserialization (the zero-copy concept).

The Arrow vs. Database Dilemma (Querying):

Arrow is designed for high-speed interoperability. It supports logical encodings like Dictionary Encoding, but by default, it uses unencoded, native memory types for numeric data (no RLE/BP physical encoding). This means the data is largely unencoded, which is why it has a poor compression ratio (typically around 1.07 compression ratio by default, meaning the file is 107% of the original size—essentially no compression). The perceived difficulty in “querying” Arrow stems from this unencoded, plain-memory nature; traditional database systems prefer to query data that is already encoded to save space and enable direct querying in the encoded domain.

Trade-off: Arrow offers the best decompression/transcoding throughput (speed) but has the worst compression ratio without explicit encoding (size).

Zero-Copy Example:

# Traditional approach (expensive):
pandas_df → serialize to JSON → network transfer → deserialize → numpy array
# Cost: Serialization + Network + Deserialization

# Arrow approach (zero-copy):
pandas_df → Arrow format → network transfer → Arrow format → numpy array
# Cost: Only network transfer (same memory layout everywhere)

2. The Trade-Offs: Performance vs. Storage Depth ⚖️

The core tension: you cannot have maximum compression and maximum speed simultaneously. This fundamental trade-off has been systematically evaluated in recent research [1], which provides the empirical foundation for understanding these format characteristics.

Dimension	Parquet (On-Disk Optimized)	Arrow (In-Memory Optimized)	Key Learning
Primary Goal	Minimize disk size and I/O	Maximize in-memory compute speed	Different formats for different stages
Compression Ratio	Best (0.13 CR = 13% of original size)	Worst (1.07 CR = 107% of original size)	Compression is essential for I/O efficiency [1]
Transcoding Speed	Slower due to intensive decoding	Fastest (Zero-copy read capability)	Worse than Parquet (especially for Zstd/zlib)
Data Skipping	Fine-Grained (Row Group-level)	Only Chunk-level, requiring full chunk reading	Skipping only necessary data is critical for low-selectivity queries
”Point Query” Access	Best for very low selectivity (finding a few records) because it decodes only needed row groups	Worst by default, as it needs to load entire row batches	Format choice depends on query pattern

The Opportunity: A Unified Co-Design

Recent research by Liu et al. [1] demonstrates that the systems of the future will need to co-design a unified in-memory and on-disk representation. This means:

Parquet needs an in-memory companion:

Leveraging Parquet’s encoded data directly in memory, avoiding the expensive conversion to Arrow when possible. An augmented Parquet variant showed up to 100x speedup over the streaming baseline by adding an in-memory representation, direct query, and SIMD.

Key Innovation:

Keep data in encoded format in memory
Enable direct querying on encoded dictionaries
Use SIMD operations for vectorized processing
Avoid decompression until absolutely necessary

Arrow needs more encoding:

Implementing optimizations like Direct Querying (running predicates on the encoded dictionary) on Arrow files, which can lead to significant speedups (2x to 4x) by skipping the data decoding step.

Key Innovation:

Add dictionary encoding support to Arrow
Enable predicate pushdown on encoded data
Maintain zero-copy benefits while gaining compression
Bridge the gap between storage and compute efficiency

3. Large-Scale Data Loading and Inference Pipelines 🤖

The true value of this design lies in distributed systems and machine learning (ML) where data movement is the primary bottleneck.

A. The Parquet ➡️ Arrow Pipeline (Loading)

In large-scale data loaders (like Apache Spark, Dask, or dlt), the combined format strategy is key:

1. Storage (Parquet):

The dataset is stored as Parquet files in a data lake for its compression and efficient columnar pruning capabilities. This minimizes cloud storage costs and I/O time.

Benefits:

Reduced storage costs (10x compression typical)
Faster data transfer from object storage
Column pruning reduces I/O by 90%+ for selective queries
Zone maps enable data skipping at file level

2. Ingestion/Transformation (Arrow):

When a data loader (e.g., PySpark) needs the data, it reads the Parquet file and converts it into an Arrow Table in memory.

The Conversion Process:

Parquet File (Disk)
  ↓ (Read + Decompress + Decode)
Arrow Table (Memory)
  ↓ (Zero-Copy Operations)
Pandas DataFrame / NumPy Array / Spark DataFrame

3. Zero-Copy Execution:

The data remains in the Arrow format while moving between Python (Pandas/NumPy) and the underlying C++/Java execution engines (like Spark or Ray). This zero-copy interoperability eliminates the serialization/deserialization bottleneck at every step of ETL/feature engineering.

Real-World Example:

Tools like dlt serialize data directly into Parquet from Arrow and deserialize directly back to Arrow, avoiding Python row-by-row processing and leveraging fast C++ libraries like pyarrow.

# Traditional ETL (slow):
for row in csv_reader:
    process_row(row)  # Python loop, row-by-row
    write_to_parquet(row)

# Arrow-based ETL (fast):
arrow_table = pyarrow.read_csv(file)  # Bulk read
arrow_table = transform(arrow_table)  # Vectorized operations
pyarrow.write_table(arrow_table, "output.parquet")  # Bulk write

Performance Gains:

10-100x faster than row-by-row processing
Memory efficient (no intermediate copies)
Leverages SIMD for vectorized operations

B. Arrow and Distributed Inference (Inference)

For large-scale ML inference pipelines, Arrow provides two crucial advantages for distributed computing:

1. Distributed Data Transfer (Arrow Flight):

In a cluster (like Ray or Spark) where models run across many nodes, data must be transferred quickly. Arrow Flight is an RPC framework specifically designed to stream large amounts of Arrow data across the network with minimal overhead. This is essential for low-latency, real-time analytics and distributed ML.

How Arrow Flight Works:

Client                    Server
  |                         |
  |-- Arrow Flight Request->|
  |                         |
  |<-- Arrow Batch Stream --|
  |<-- Arrow Batch Stream --|
  |<-- Arrow Batch Stream --|

Key Features:

Streaming: Data flows as it’s processed, not all at once
Zero-Copy: Network protocol uses Arrow’s memory layout directly
Parallel: Multiple streams can run concurrently
Low Latency: Optimized for real-time scenarios

Use Cases:

Real-time feature serving
Distributed model inference
Stream processing pipelines
Federated learning data transfer

2. GPU Acceleration (RAPIDS):

Arrow’s standardized columnar memory format is optimized for SIMD and is the core format for GPU-accelerated libraries like RAPIDS cuDF. This allows data to be moved directly to the GPU for feature engineering and inference without expensive conversions, maximizing hardware utilization.

The GPU Pipeline:

CPU Memory (Arrow)
  ↓ (Direct Memory Transfer)
GPU Memory (cuDF)
  ↓ (GPU Processing)
GPU Results (cuDF)
  ↓ (Direct Memory Transfer)
CPU Memory (Arrow)

Benefits:

No CPU-GPU Conversion: Arrow format is GPU-ready
SIMD Optimization: Vectorized operations on GPU
Memory Efficiency: Direct memory mapping
High Throughput: Process billions of rows per second

ML Framework Integration:

ML frameworks like TensorFlow I/O and PyTorch also leverage Arrow for faster data ingestion into the training/inference loops.

Example: TensorFlow with Arrow

import tensorflow_io as tfio

# Read Parquet directly into TensorFlow
dataset = tfio.IODataset.from_parquet("data.parquet")
# Arrow format is used internally for zero-copy transfer

Example: PyTorch with Arrow

import pyarrow.parquet as pq

# Read Parquet to Arrow
table = pq.read_table("data.parquet")
# Convert to PyTorch tensor with minimal overhead
tensor = torch.from_numpy(table.column("features").to_numpy())

4. How Parquet and Arrow Coexist in Modern ML Dataloaders 🚀

Modern ML pipelines rarely choose Parquet or Arrow — they usually use both, because each solves a different problem.

A high-performance data stack typically looks like this:

Parquet → efficient storage
Arrow → efficient in-memory compute
DuckDB / HF Datasets / Ray / Spark → engines that convert between them on the fly

This “storage vs memory” layering is exactly what gives today’s ML workloads their speed.

🧱 Why Parquet + Arrow is the Winning Combination

Parquet = Optimized Storage Format:

Columnar on disk
Compressed (Snappy, ZSTD)
Row-group layout enables scanning only required portions
Ideal for data lakes, S3, long-term storage
Cheap to store, cheap to transfer

Arrow = Optimized In-Memory Compute:

Zero-copy buffers
SIMD/vectorized execution
Shared memory between languages (Python/C++/Rust)
No decoding overhead
Acts like NumPy for tabular data

🔑 Together:

Store as Parquet → load only what you need → operate in Arrow → pass to the model/dataloader with zero-copy.

This is the core of fast ML data systems like DuckDB, HuggingFace Datasets, Ray Data, Polars, etc.

📘 Real-World Case Study: HuggingFace Datasets (HF)

HuggingFace Datasets is the most widely used ML dataloader today that combines Parquet + Arrow cleanly. Understanding how HF works internally reveals the true power of the Parquet → Arrow pipeline.

🔹 The Core Idea: Parquet vs Arrow

Apache Parquet:

Optimized on-disk storage format
Columnar, compressed, encoded
Very cheap to store, scan, filter
Not memory-mapped directly — must be decoded first
Think: cold format (disk → CPU decoding → memory)

Apache Arrow:

In-memory columnar format
Zero-copy slicing
Random access is near-instant
Interoperates with GPUs, Pandas, DuckDB, Polars
Think: hot format (RAM or memmap)

🔹 How HF Stores Data Internally

HF stores datasets as:

dataset/
  - data-00000-of-00001.arrow
  - data-00001-of-00001.arrow
  - state.json
  - metadata.json

Even if the original source was Parquet.

Why Arrow IPC files instead of Parquet?

Arrow IPC (Feather) supports memory-mapping
Parquet does not (requires decoding)
Arrow supports zero-copy slicing, enabling random access by index
HF needs fast __getitem__(i) for training

🔹 The Two-Phase Lifecycle: Prepare Time vs Load Time

🧠 Prepare Time (Dataset Creation / Preprocessing Step)

This happens once when creating or preprocessing the dataset:

Read Parquet using pyarrow / fsspec / dataset streaming
Decode Parquet pages → Arrow record batches
Apply transformations (casting, tokenization if batched)
Write Arrow IPC shards: data-xxxxx.arrow
Build Arrow index metadata

⏳ Cost: Expensive
🔧 Why: Parquet decoding + Arrow reshaping + writing shards

After this, dataset is cached and extremely fast to reload.

The internal pipeline (simplified):

User loads Parquet → HF reads batches → Converts to Arrow record batches 
→ Saves arrow file → mmap on read → zero-copy column access

⚡ Load Time (During Training / Dataloader Epochs)

This happens repeatedly during training:

Memory-map Arrow files: mmap("data-00000.arrow")
Dataloader calls dataset[i]
HF slices the needed columns with zero-copy Arrow slicing
Converts to Python dict of numpy/pytensor for your framework
Yields the training batch

⏳ Cost: Extremely cheap
🔧 Why: No decoding, no I/O except page faults

🔹 The Correct Mental Model

❌ Wrong (common misunderstanding)

Dataloader loads Parquet directly at training time.

✔ Correct

Dataloader loads Arrow, but Arrow was originally generated FROM Parquet.

Key Insight:

Parquet → Arrow conversion happens only once (during prepare time)
Arrow → Training happens repeatedly (during load time)

Example:

from datasets import load_dataset

# Prepare time: Converts Parquet → Arrow IPC (happens once, cached)
dataset = load_dataset("squad")  # If source is Parquet, converts to Arrow

# Load time: Memory-maps Arrow files (happens every epoch, fast)
train_data = dataset["train"]
# Memory-mapped Arrow tables, not full RAM load

# Transformations happen in Arrow (zero-copy)
tokenized = train_data.map(tokenize_function, batched=True)
# Arrow buffers → Arrow buffers, no Python loops

# Only final batch converted to PyTorch
dataloader = DataLoader(tokenized, batch_size=32)
# Arrow → NumPy → PyTorch tensor (minimal overhead)

🔹 End-to-End HF Dataset Flow

The following diagrams illustrate the complete lifecycle:

A. Dataset Creation Pipeline (Prepare Time):

flowchart TD
    A["User provides Parquet/JSON/CSV"] --> B["PyArrow Reader"]
    B --> C["Convert to Arrow RecordBatch"]
    C --> D["HF Processing: map(), filter(), cast()"]
    D --> E["Write Arrow IPC Shards"]
    E --> F["Generate metadata.json & state.json"]
    F --> G["Dataset Ready"]

B. Training-Time Random Access (Load Time):

flowchart LR
    A["Arrow IPC Files on Disk"] -->|mmap| B["Arrow Table in Virtual Memory"]
    B --> C["HF Dataset __getitem__"]
    C --> D["Arrow Zero-Copy Slice"]
    D --> E["Convert to numpy/torch tensors"]
    E --> F["Train Step"]

C. Parquet vs Arrow Roles:

flowchart TD
    P["Parquet (On Disk)"] -->|Decode Once| A["Arrow RecordBatch"]
    A -->|Write IPC| B["Arrow IPC Shards"]
    B -->|mmap| C["Fast Runtime Access"]

🔹 Case Study: Large Dataset Performance

Example: C4 Dataset

Source: web text in gzipped JSON
HF preprocess step: JSON → Arrow → shard
Runtime: Arrow shards → memory-mapped
Training efficiency: 8–20x faster vs json/parquet loading directly

Example: BigCode / Stack-Dedup

Stored directly in Arrow shards
Indexed efficiently
Splits handled by slicing metadata
Scales to terabytes on a single machine

This is why HF Datasets scale to hundreds of GB on a laptop—the memory-mapped Arrow files enable efficient random access without loading everything into RAM.

📦 What Dataloaders Use Parquet/Arrow?

The following table shows how different dataloaders handle Parquet and Arrow:

Loader Type	Uses Parquet at Load Time?	Uses Arrow at Load Time?	Notes
HF Datasets	❌ (only during initial prepare)	✔ (runtime)	Most optimized for index-based access. Converts Parquet → Arrow IPC once, then memory-maps Arrow files.
PyTorch `DataLoader` + manual Parquet	✔ (slow)	❌	Requires manual decoding every batch. Not recommended for large datasets.
tf.data.experimental.make_csv_dataset	✔	❌	Reads CSV/TFRecord from disk. No Arrow integration.
Petastorm	✔	Partial	Built for distributed training on Parquet — heavy decoding overhead.
Ray Dataset	✔	✔	Uses Arrow internally, reads Parquet into Arrow batches. Good for distributed training.
Polars DataLoader	✔	✔	Same pattern as Ray. Lazy evaluation with Arrow compute.
DuckDB	✔	✔	Queries Parquet directly, returns Arrow results. Excellent for feature stores.

Key Insight: HF Datasets is the only mainstream loader that:

Converts Parquet → Arrow IPC during prepare time
Stores Arrow IPC shards on disk
Memory-maps them for random access at load time

✔ HuggingFace Datasets:

Internal storage: Arrow IPC files (memory-mapped)
External storage: Parquet/sharded Parquet (converted once)
Supports memory-mapped Arrow tables → zero-copy
Unique: Only loader that pre-converts Parquet to Arrow IPC

✔ Ray Data:

Reads/writes Parquet
Converts partitions to Arrow for transformations
Feeds Arrow batches to Ray Train/Torch DDP
Difference from HF: Converts Parquet → Arrow on-the-fly, not pre-converted

Example:

import ray.data as rd

# Read Parquet (converts to Arrow on-the-fly)
dataset = rd.read_parquet("s3://bucket/data/")

# Transform in Arrow
transformed = dataset.map_batches(
    preprocess_fn,
    batch_format="arrow"  # Uses Arrow format
)

# Feed to training
for batch in transformed.iter_torch_batches():
    # Arrow → PyTorch with minimal overhead
    train_step(batch)

✔ DuckDB Dataloaders:

Parquet → Arrow Scanner → Arrow RecordBatch
Used in ML feature stores (Feast, Tecton)
Difference from HF: Queries Parquet directly, doesn’t pre-convert

Example:

import duckdb

# Query Parquet, get Arrow result
result = duckdb.sql("""
    SELECT * FROM 'data.parquet' 
    WHERE age > 30
""").arrow()

# Zero-copy Arrow table
# Can convert to Pandas/NumPy/PyTorch

✔ PyTorch DataPipes:

Some Torch DataPipes now support:

Parquet datapipe
Arrow datapipe
Less common but growing
Note: Still requires manual setup, not as optimized as HF

✔ Polars LazyFrame + Torch Integration:

Polars performs compute in Arrow
Scans Parquet lazily
Can convert batches → NumPy → Torch tensors
Difference from HF: Lazy evaluation, doesn’t pre-convert to Arrow IPC

Example:

import polars as pl
import torch

# Lazy scan Parquet
df = pl.scan_parquet("data/*.parquet")

# Compute in Arrow
result = df.filter(pl.col("age") > 30).collect()

# Convert to PyTorch
tensor = torch.from_numpy(result.to_numpy())

✔ Spark / Glue / Databricks:

Uses Parquet for storage
Arrow for Pandas UDFs and fast data interchange
Note: Primarily batch processing, not optimized for random access like HF

🧭 Simple Mental Model

Here’s a simple mental model for understanding the relationship:

Parquet is like a compressed warehouse.
Arrow is like unpacked items on the workbench.
ML training is easier when items are on the workbench.

The Chef Analogy (for HF Datasets specifically):

Parquet is the “warehouse”: efficient, compressed, long-term storage.
Arrow is the “kitchen line”: fast, open containers ready to serve every item instantly.
HF Datasets is the “chef”: it converts warehouse supplies (Parquet) to prep tables (Arrow), and during training, it just slices and serves without cooking again.

The Classic Flow:

You store long-term data in Parquet (cheap, compressed).
Prepare time: Convert Parquet → Arrow IPC once (the “prep work”).
Load time: Your ML engine memory-maps Arrow files (fast, zero-copy).
Your dataloader feeds Arrow batches into the model.
Nothing is wasted: no full reads, no full decompress, no repeated conversion.

This is why every modern ML infra stack uses this pattern. The key insight is that the expensive conversion happens once, and the fast access happens repeatedly.

🗺️ How Parquet & Arrow Coexist: The Data Flow

The following diagram illustrates how Parquet and Arrow work together in modern ML dataloaders:

flowchart LR

    subgraph Storage["On-Disk Storage Layer"]
        P1["Parquet Shards<br/>train-000.parquet"]
        P2["Parquet Shards<br/>train-001.parquet"]
        IDX["Arrow Index Files<br/>.arrow / .idx"]
    end

    subgraph Loader["Reader / Scanner Layer"]
        S1["Column Pruning<br/>Read only needed columns"]
        S2["Row-group Filtering<br/>Predicate pushdown"]
        S3["Arrow RecordBatches<br/>(Streaming)"]
    end

    subgraph Memory["In-Memory Arrow Layer"]
        A1["Arrow Table<br/>Zero-copy buffers"]
        A2["Vectorized Ops<br/>map / filter / tokenize"]
    end

    subgraph Dataloader["ML Dataloader Layer"]
        B1["Batch to NumPy"]
        B2["Batch to PyTorch / JAX / TF"]
    end

    P1 --> S1
    P2 --> S1
    IDX --> S1

    S1 --> S2 --> S3
    S3 --> A1 --> A2 --> B1 --> B2

Key Flow Steps:

Storage Layer: Data stored as sharded Parquet files with Arrow index files for fast random access
Scanner Layer: Column pruning and row-group filtering reduce I/O by reading only necessary data
Arrow Layer: Zero-copy buffers enable vectorized operations without Python loops
Dataloader Layer: Final conversion to framework-specific tensors (PyTorch, JAX, TensorFlow)

📊 Practical Benchmark: Parquet vs Arrow Query Performance

The performance difference illustrates the trade-off clearly:

Pattern 1: Direct Parquet Query (with column pruning and row group skipping)

Reads only needed columns from disk
Skips entire row groups based on zone maps
Requires decoding/decompression

Pattern 2: Arrow In-Memory Query (after one-time conversion)

Data already in memory (memory-mapped)
Zero-copy operations
No I/O overhead

Typical Results:

Parquet: 0.5-2.0s (depends on selectivity, benefits from column pruning and row group skipping)
Arrow: 0.1-0.5s (faster for repeated queries, no I/O overhead)

Takeaway: Parquet wins for one-time queries with high selectivity. Arrow wins for repeated queries or when data fits in memory. This is why the hybrid approach (store in Parquet, convert to Arrow once, then use Arrow) is optimal.

⚠️ Common Pitfalls in ML Dataloaders

1. Arrow Table ≠ Arrow Dataset:

# ❌ Wrong: Loading entire dataset into memory
table = pq.read_table("large_dataset.parquet")  # OOM risk!

# ✅ Correct: Use streaming or memory-mapped access
dataset = pq.ParquetDataset("large_dataset/")  # Lazy loading
for batch in dataset.iter_batches():
    process(batch)

2. Parquet ≠ Fast for Small Batch Random Access:

# ❌ Wrong: Random access on Parquet
for idx in random_indices:
    row = read_parquet_row("data.parquet", idx)  # Very slow!

# ✅ Correct: Use Arrow index or convert to Arrow first
arrow_table = pq.read_table("data.parquet")
for idx in random_indices:
    row = arrow_table.slice(idx, 1)  # Fast random access

3. Arrow ≠ Good for Storage:

# ❌ Wrong: Storing Arrow files long-term
arrow_table.write_feather("data.arrow")  # Poor compression

# ✅ Correct: Store as Parquet, convert to Arrow when needed
arrow_table = pq.read_table("data.parquet")  # Better compression

💡 Rule of Thumb: When to Use What

Use Case	Format	Reason
Long-term storage	Parquet	Best compression, efficient I/O
Data lake / S3	Parquet	Industry standard, cost-effective
In-memory analytics	Arrow	Zero-copy, SIMD optimization
ML training batches	Arrow	Fast conversion to tensors
Random access	Arrow (with index)	Fast lookups, memory-mapped
One-time queries	Parquet	Column pruning, data skipping
Repeated queries	Arrow	No repeated I/O overhead
Distributed transfer	Arrow Flight	Network-optimized streaming

The Golden Rule: Store in Parquet, compute in Arrow, convert to tensors only when needed.

5. Encoding Techniques: The Compression Arsenal 🗜️

Understanding the encoding techniques is crucial for optimizing data pipelines.

Dictionary Encoding (DICT)

How it works:

Build a dictionary of unique values
Replace values with dictionary indices
Store dictionary separately

Example:

Original: ["NYC", "SF", "NYC", "LA", "SF", "NYC"]
Dictionary: {0: "NYC", 1: "SF", 2: "LA"}
Encoded: [0, 1, 0, 2, 1, 0]

Benefits:

Excellent compression for low-cardinality columns
Enables direct querying on encoded values
Fast lookups via dictionary

Trade-offs:

Dictionary must fit in memory
Less effective for high-cardinality columns

Run-Length Encoding (RLE)

How it works:

Compress sequences of identical values
Store as (value, count) pairs

Example:

Original: [1, 1, 1, 1, 2, 2, 3, 3, 3]
Encoded: [(1, 4), (2, 2), (3, 3)]

Benefits:

Excellent for sorted or nearly-sorted data
Very fast encoding/decoding
Minimal CPU overhead

Use Cases:

Time-series data
Sorted columns
Boolean columns

Bit-Packed Encoding (BP)

How it works:

Store small integers using minimal bits
Pack multiple values into single bytes

Example:

Values: [3, 5, 2, 7] (all < 8, need 3 bits each)
Packed: 011 101 010 111 (12 bits = 1.5 bytes vs 16 bytes original)

Benefits:

Maximum compression for small integers
Direct bit manipulation
SIMD-friendly operations

Use Cases:

Integer IDs
Enumerated types
Small numeric ranges

Delta Encoding

How it works:

Store differences between consecutive values
Often combined with RLE for sorted data

Example:

Original: [100, 102, 105, 109, 114]
Deltas: [100, +2, +3, +4, +5]

Benefits:

Excellent for sorted sequences
Smaller values compress better
Works well with RLE

6. Zone Maps and Data Skipping: The Query Optimizer’s Best Friend 🗺️

Zone Maps are metadata structures that store min/max values for data chunks, enabling powerful query optimizations.

How Zone Maps Work

For each column chunk, Parquet stores:

Minimum value
Maximum value
Null count
Distinct count (optional)

Example:

Query: WHERE age > 50

Zone Map for age column:
Chunk 1: min=18, max=35  → Skip (all values < 50)
Chunk 2: min=25, max=45  → Skip (all values < 50)
Chunk 3: min=40, max=60  → Read (may contain values > 50)
Chunk 4: min=55, max=75  → Read (all values > 50)

Result: Only 2 out of 4 chunks need to be read, reducing I/O by 50%.

Fine-Grained vs. Chunk-Level Skipping

Parquet (Fine-Grained):

Row group-level statistics (zone maps)
Can skip individual row groups based on min/max values
Most granular skipping capability

Arrow (Chunk-Level):

Only chunk-level statistics
Must read entire chunk if any value matches
Less efficient for selective queries

ORC (Balanced):

Stripe-level statistics
Smaller stripes than Arrow chunks
Better than Arrow, worse than Parquet

Real-World Impact

For a query selecting 1% of rows:

Parquet: Reads ~1-5% of data (excellent skipping)
Arrow: Reads ~10-20% of data (chunk-level only)

7. The Future: Unified Co-Design 🚀

The next generation of data systems will bridge the gap between Parquet and Arrow.

Augmented Parquet: In-Memory Representation

Key Innovations:

Direct Query on Encoded Data:
- Keep data in dictionary-encoded format in memory
- Run predicates directly on dictionary indices
- Avoid decompression until final output
SIMD-Optimized Operations:
- Vectorized operations on encoded data
- Leverage CPU SIMD instructions
- Process multiple values in parallel
Hybrid Format:
- Parquet on disk (compressed)
- Augmented Parquet in memory (encoded but queryable)
- Arrow for inter-system transfer (when needed)

Performance Gains:

Up to 100x speedup over streaming baseline [1]
Maintains Parquet’s compression benefits
Gains Arrow’s zero-copy advantages

Enhanced Arrow: Encoding Support

Key Innovations:

Dictionary Encoding in Arrow:
- Add dictionary type support
- Enable direct querying on dictionaries
- Maintain zero-copy semantics
Predicate Pushdown:
- Run filters on encoded data
- Skip decoding for filtered-out rows
- 2x to 4x speedup demonstrated [1]
Compression Options:
- Optional compression for Arrow files
- Trade speed for storage when needed
- Flexible compression levels

8. Practical Recommendations and Tools Integration 💡

When to Use Parquet

✅ Best for:

Long-term storage (data lakes, archives)
High compression requirements
Selective queries (low selectivity)
Cost-sensitive storage (cloud object storage)
Batch processing pipelines

Example Use Cases:

Data warehouse storage
ML training data archives
Log file storage
Historical data retention

When to Use Arrow

✅ Best for:

In-memory analytics
Real-time processing
Inter-system data transfer
GPU acceleration
Low-latency queries

Example Use Cases:

Feature engineering pipelines
Real-time dashboards
Distributed ML inference
Stream processing
Interactive analytics

Hybrid Approach: The Best of Both Worlds

Recommended Pattern (The Golden Rule):

Storage Layer:     Parquet (compressed, efficient)
  ↓
Ingestion Layer:   Parquet → Arrow (one-time conversion)
  ↓
Processing Layer:  Arrow (zero-copy, fast)
  ↓
Output Layer:      Arrow → Parquet (for persistence)

Tools That Support This Pattern

Apache Spark:

Native Parquet/Arrow support
Reads Parquet with column pruning and row group skipping
Converts to Arrow for zero-copy operations with Pandas

Dask:

Seamless Parquet/Arrow integration
Lazy loading from Parquet
Automatic Arrow conversion for in-memory operations

Polars:

Built on Arrow, writes Parquet
All operations use Arrow format internally
Zero-copy operations throughout

DuckDB:

Optimized for both formats
Queries Parquet directly with row group skipping
Returns Arrow results for zero-copy access

HuggingFace Datasets:

Converts Parquet → Arrow IPC once (prepare time)
Memory-maps Arrow files for fast random access (load time)
Enables handling massive datasets on consumer hardware

9. Performance Benchmarks and Real-World Impact 📊

Compression Ratios

Typical Compression Ratios (lower is better):

Format	Compression Ratio	Use Case
Parquet (Snappy)	0.13-0.20 (13-20% of original)	Balanced compression/speed
Parquet (Zstd)	0.10-0.15 (10-15% of original)	Maximum compression
Arrow (uncompressed)	1.00-1.10 (100-110% of original)	Maximum speed
Arrow (compressed)	0.20-0.30 (20-30% of original)	Balanced option

Query Performance

Selective Query (1% selectivity):

Parquet: 10-50x faster than row-oriented (benefits from row group skipping)
Arrow: 5-20x faster (if data fits in memory)

Full Table Scan:

Parquet: 2-5x faster than row-oriented
Arrow: 10-100x faster (zero-copy, SIMD)

Storage Cost Impact

Example: 1TB dataset

Format	Storage Size	Monthly Cost (S3)	Savings
CSV	1 TB	$23	Baseline
Parquet	130 GB (0.13 CR)	$3	87% savings
Arrow	1.07 TB (1.07 CR)	$24.61	No savings

Annual Savings with Parquet: $240 per TB

10. Common Pitfalls and Best Practices ⚠️

Common Mistakes

1. Over-Partitioning Parquet Files:

# Bad: Too many small files
df.write.partitionBy("year", "month", "day", "hour", "minute").parquet("data/")
# Result: Millions of tiny files, poor performance

# Good: Balanced partitioning
df.write.partitionBy("year", "month").parquet("data/")
# Result: Reasonable file sizes, good performance

2. Ignoring Column Order:

# Bad: Frequently queried columns last
schema = ["id", "metadata1", "metadata2", "name", "age"]  # name, age queried often

# Good: Frequently queried columns first
schema = ["name", "age", "id", "metadata1", "metadata2"]
# Enables better column pruning

3. Wrong Compression Codec:

# Bad: Gzip (slow) for frequently accessed data
df.write.parquet("data/", compression="gzip")

# Good: Snappy (fast) for hot data, Zstd for cold data
df.write.parquet("data/", compression="snappy")  # Hot data
df.write.parquet("archive/", compression="zstd")  # Cold data

Best Practices

1. Optimal Row Group Size:

Parquet: 128MB - 1GB row groups
Balance between skipping granularity and read efficiency
Larger row groups = better compression, less skipping

2. Column Ordering:

Put frequently queried columns first
Put high-cardinality columns early
Put filtering columns early

3. Compression Selection:

Snappy: Fast, moderate compression (hot data)
Zstd: Slower, excellent compression (cold data)
LZ4: Very fast, good compression (real-time)

4. Arrow Chunk Sizes:

Small chunks: Better for streaming, more overhead
Large chunks: Better for batch processing, less overhead
Typical: 64KB - 1MB per chunk

Conclusion: The Path Forward

The core lesson for data engineers and ML practitioners is: Parquet wins the battle of storage efficiency and I/O, while Arrow is the high-speed rail system for data in motion, making both indispensable in modern large-scale pipelines.

Returning to our analogy: Parquet is the archivist—meticulously compressing and indexing data on permanent storage, enabling efficient retrieval through column pruning and row group skipping. Arrow is the zero-overhead desk organizer—ready for immediate use, optimized for the CPU, enabling zero-copy operations and blazing-fast computation.

The Golden Rule (Revisited):

Store in Parquet, compute in Arrow, convert to tensors only when needed.

This principle encapsulates everything we’ve discussed:

Parquet for the warehouse (storage efficiency, compression, I/O optimization)
Arrow for the workbench (in-memory compute, zero-copy, SIMD-ready)
Tensors only at the final step (when feeding the model)

Key Takeaways:

Use Parquet for Storage: Best compression (0.13 CR = 13% of original), excellent I/O efficiency, fine-grained row group skipping
Use Arrow for Processing: Zero-copy operations, SIMD optimization, language interoperability
Hybrid Approach: Store in Parquet, process in Arrow, write back to Parquet
Future is Unified: Next-gen systems will combine Parquet’s encoding with Arrow’s speed
Choose Wisely: Format selection depends on your query patterns and access patterns

The Future:

The next generation of data systems will bridge the gap, creating formats that offer:

Parquet-level compression
Arrow-level speed
Unified in-memory and on-disk representation
Direct querying on encoded data
SIMD-optimized operations

As we build the data systems of tomorrow, understanding these fundamental trade-offs and design principles will be crucial for creating efficient, scalable, and cost-effective pipelines.

Further Reading:

Primary Research Paper: Liu, Chunwei, et al. “A Deep Dive into Common Open Formats for Analytical DBMSs”. Proceedings of the VLDB Endowment 16.11 (2023): 3044-3056. This comprehensive evaluation of Parquet, Arrow, and ORC provides the empirical foundation for understanding format trade-offs and performance characteristics.
“Apache Arrow and the Future of Data Frames” with Wes McKinney
Apache Parquet Documentation
Apache Arrow Documentation
Columnar Storage and Vectorization

Understanding data formats is fundamental to building efficient ML systems. In production, the choice between Parquet and Arrow can mean the difference between a pipeline that costs thousands per month versus one that costs hundreds.