6.4 KiB
PyArrow
PyArrow is the Python implementation of Apache Arrow, providing a high-performance interface to the Arrow columnar memory format and computing libraries. It enables efficient data interchange, in-memory analytics, and seamless integration with the Python data science ecosystem including pandas, NumPy, and big data processing systems.
Package Information
- Package Name: pyarrow
- Language: Python
- Installation:
pip install pyarrow - Documentation: https://arrow.apache.org/docs/python
Core Imports
import pyarrow as pa
Common specialized imports:
import pyarrow.compute as pc
import pyarrow.parquet as pq
import pyarrow.csv as csv
import pyarrow.dataset as ds
import pyarrow.flight as flight
Basic Usage
import pyarrow as pa
import numpy as np
# Create arrays from Python data
arr = pa.array([1, 2, 3, 4, 5])
str_arr = pa.array(['hello', 'world', None, 'arrow'])
# Create tables
table = pa.table({
'integers': [1, 2, 3, 4],
'strings': ['foo', 'bar', 'baz', None],
'floats': [1.0, 2.5, 3.7, 4.1]
})
# Read/write Parquet files
import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
loaded_table = pq.read_table('example.parquet')
# Compute operations
import pyarrow.compute as pc
result = pc.sum(arr)
filtered = pc.filter(table, pc.greater(table['integers'], 2))
Architecture
PyArrow's design centers around the Arrow columnar memory format:
- Columnar Storage: Data organized by columns for efficient analytical operations
- Zero-Copy Operations: Memory-efficient data sharing between processes and languages
- Type System: Rich data types including nested structures, decimals, and temporal types
- Compute Engine: Vectorized operations for high-performance analytics
- Format Support: Native support for Parquet, CSV, JSON, ORC, and custom formats
- Interoperability: Seamless integration with pandas, NumPy, and other Python libraries
This architecture enables PyArrow to serve as a foundational component for building scalable data processing applications with fast data movement between systems while maintaining memory efficiency through columnar layouts.
Capabilities
Core Data Structures
Fundamental data containers including arrays, tables, schemas, and type definitions. These form the foundation for all PyArrow operations and provide the columnar data structures that enable efficient analytics.
def array(obj, type=None, mask=None, size=None, from_pandas=None, safe=True): ...
def table(data, schema=None, metadata=None, columns=None): ...
def schema(fields, metadata=None): ...
def field(name, type, nullable=True, metadata=None): ...
class Array: ...
class Table: ...
class Schema: ...
class Field: ...
Data Types System
Comprehensive type system supporting primitive types, nested structures, temporal types, and custom extension types. Provides type checking, conversion, and inference capabilities essential for data processing workflows.
def int64(): ...
def string(): ...
def timestamp(unit, tz=None): ...
def list_(value_type): ...
def struct(fields): ...
class DataType: ...
def is_integer(type): ...
def cast(arr, target_type, safe=True): ...
Compute Functions
High-performance vectorized compute operations including mathematical functions, string operations, temporal calculations, aggregations, and filtering. The compute engine provides 200+ functions optimized for columnar data.
def add(x, y): ...
def subtract(x, y): ...
def multiply(x, y): ...
def sum(array): ...
def filter(data, mask): ...
def take(data, indices): ...
File Format Support
Native support for reading and writing multiple file formats including Parquet, CSV, JSON, Feather, and ORC. Provides high-performance I/O with configurable options for compression, encoding, and metadata handling.
# Parquet
def read_table(source, **kwargs): ...
def write_table(table, where, **kwargs): ...
# CSV
def read_csv(input_file, **kwargs): ...
def write_csv(data, output_file, **kwargs): ...
Memory and I/O Management
Memory pool management, buffer operations, compression codecs, and file system abstraction. Provides control over memory allocation and efficient I/O operations across different storage systems.
def default_memory_pool(): ...
def compress(data, codec=None): ...
def input_stream(source): ...
class Buffer: ...
class MemoryPool: ...
Dataset Operations
Multi-file dataset interface supporting partitioned data, lazy evaluation, and distributed processing. Enables efficient querying of large datasets stored across multiple files with automatic partition discovery.
def dataset(source, **kwargs): ...
def write_dataset(data, base_dir, **kwargs): ...
class Dataset: ...
class Scanner: ...
Arrow Flight RPC
High-performance RPC framework for distributed data services. Provides client-server architecture for streaming large datasets with authentication, metadata handling, and custom middleware support.
def connect(location, **kwargs): ...
class FlightClient: ...
class FlightServerBase: ...
class FlightDescriptor: ...
Advanced Features
Specialized functionality including CUDA GPU support, Substrait query integration, execution engine operations, and data interchange protocols for advanced use cases and system integration.
# CUDA support
class Context: ...
class CudaBuffer: ...
# Substrait integration
def run_query(plan): ...
def serialize_expressions(expressions): ...
Version and Build Information
def show_versions(): ...
def show_info(): ...
def cpp_build_info(): ...
def runtime_info(): ...
Access to version information, build configuration, and runtime environment details for troubleshooting and compatibility checking.
Exception Handling
class ArrowException(Exception): ...
class ArrowInvalid(ArrowException): ...
class ArrowTypeError(ArrowException): ...
class ArrowIOError(ArrowException): ...
Comprehensive exception hierarchy for error handling in data processing workflows.