Sunday, September 8, 2024
Homepandas The strongest Pandas replacement--Polars

[Python Basics] The strongest Pandas replacement–Polars

Polars is a high-performance DataFrame library for manipulating structured data. It can be said to be the most promising package to replace pandas. At its core, Polars is written in Rust, but the library also provides a Python interface. Its main features include:

  • Fast: Polars is written from scratch, tightly integrated with the machine, and has no external dependencies.
  • I/O: Best-in-class support for all common data storage tiers: local, cloud storage, and database.
  • Ease of use: Write queries with original intent. Internally Polars uses its query optimizer to determine the most efficient way to perform.
  • Offline processing: Polars supports offline data conversion through its streaming API. This allows you to process the results without having to store all the data in memory at the same time.
  • Parallel Processing: Polars takes full advantage of computer performance by distributing the workload among available CPU cores without requiring additional configuration.
  • Vectorized query engine: Polars uses Apache Arrow, a columnar data format, to process queries in a vectorized manner. It uses SIMD to optimize CPU usage.

User guide: https://pola-rs.github.io/polars/user-guide/
API reference: https://pola-rs.github.io/polars/py-polars/html/reference/io.html

introduce

Polars aims to provide a lightning-fast DataFrame library with the following features:

  • Utilizes all available cores on the computer.
  • Reduce unnecessary work/memory allocations by optimizing queries.
  • Process data sets much larger than available RAM.
  • Have a consistent and predictable API.
  • Have a strict schema (data types should be known before running the query).

Polars is written in Rust, which makes it C/C++ performant and allows it full control over performance-critical parts of the query engine. Therefore, Polars has put a lot of effort into this:

  • Reduce redundant replication.
  • Efficiently traverse in-memory caches.
  • Minimize contention in parallelism.
  • Process data in chunks.
  • Reuse memory allocation.

# 1. Basics

Series & DataFrames

Series is a one-dimensional data structure. In a Series, all elements have the same data type (eg, integer, string). The following snippet shows how to create a simple Series object with a name.

import polars as pl
import numpy as np

s = pl.Series("a", [1, 2, 3, 4, 5])
print(s)
s = pl.Series("a", [1, 2, 3, 4, 5])
print(s.min())
print(s.max())
s = pl.Series("a", ["polar", "bear", "arctic", "polar fox", "polar bear"])
s2 = s.str.replace("polar", "pola")
print(s2)
from datetime import date

start = date(2001, 1, 1)
stop = date(2001, 1, 9)
s = pl.date_range(start, stop, interval="2d", eager=True)
print(s.dt.day()) 

DataFrame is a two-dimensional data structure supported by one or more Series and can be viewed as an abstraction of a series (such as a list) of Series. The operations you can perform on a DataFrame are very similar to those you can perform in SQL queries. You can perform GROUP BY, JOIN, PIVOT, and define custom functions.

from datetime import datetime

df = pl.DataFrame(
    {
        "integer": [1, 2, 3, 4, 5],
        "date": [
            datetime(2022, 1, 1),
            datetime(2022, 1, 2),
            datetime(2022, 1, 3),
            datetime(2022, 1, 4),
            datetime(2022, 1, 5),
        ],
        "float": [4.0, 5.0, 6.0, 7.0, 8.0],
    }
)

print(df)
print(df.head(3))
print(df.describe()) 

Reading & writing

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "date": [
            datetime(2022, 1, 1),
            datetime(2022, 1, 2),
            datetime(2022, 1, 3),
        ],
        "float": [4.0, 5.0, 6.0],
    }
)

print(df)
df.write_csv("output.csv")
df_csv = pl.read_csv("output.csv")
print(df_csv)
df_csv = pl.read_csv("output.csv", try_parse_dates=True)
print(df_csv) 

Expressions

import polars as pl

# Create a simple DataFrame
data = {'column1': [1, 2, 3],
        'column2': ['a', 'b', 'c']}
df = pl.DataFrame(data)

# Use expressions to select
selected_df = df.select(['column1'])

# Filter using expressions
filtered_df = df.filter(df['column1'] > 1)
selected_df
filtered_df 

# 2. Splicing

df = pl.DataFrame(
    {
        "a": np.arange(0, 8),
        "b": np.random.rand(8),
        "d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None],
    }
)

df2 = pl.DataFrame(
    {
        "x": np.arange(0, 8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)
joined = df.join(df2, left_on="a", right_on="x")
print(joined)
stacked = df.hstack(df2)
print(stacked) 

# 3. Concept

Data types

Polars are based entirely on the Arrow data type and are backed by Arrow memory arrays. This enables data processing to be well supported in terms of cache efficiency and cross-process communication. Most data types are fully consistent with Arrow's implementation, with some exceptions, such as Utf8 (actually LargeUtf8), Categorical, and Object (with limited support). Here are some data types:

Group Type Details
Number Int8 8-bit signed integer.
Int16 16-bit signed integer.
Int32 32-bit signed integer.
Int64 64-bit signed integer.
UInt8 8-bit unsigned integer.
UInt16 16-bit unsigned integer.
UInt32 32-bit unsigned integer.
UInt64 64-bit unsigned integer.
Float32 32-bit floating point number.
Float64 64-bit floating point number.
Nested Struct Structure array represented as Vec<Series>, for packing multiple/heterogeneous values ​​in a single column.
List A list array contains a subarray containing the list values ​​and an offset array (internally it is actually Arrow's LargeList).
Time Date Date representation, internally represented as the number of days since the UNIX epoch, encoded as a 32-bit signed integer.
Datetime Datetime representation, internally represented as the number of microseconds since the UNIX epoch, encoded as a 64-bit signed integer.
Duration The type representing the time difference, internally represented as microseconds. Created by subtracting Date/Datetime.
Time Time representation, internally expressed as nanoseconds from midnight.
Others Boolean Boolean type, actually packed in bits.
Utf8 String data (actually Arrow's LargeUtf8 internally).
Binary Stores data in bytes.
Object Limited supported data type, can be any value.
Categorical A categorical encoding of a set of strings.

Contexts

Polars has developed its own Domain Specific Language (DSL) for data transformation. The language is very easy to use, allowing complex queries to be made while maintaining human readability. The two core components of the language are Contexts and Expressions, which we'll cover in the next section.

As the name suggests, context refers to the context in which an expression needs to be evaluated. There are three main contexts[1]:

  1. Selection: df.select([..]), df.with_columns([..])
  2. Filtering: df.filter()
  3. Group by / Aggregation: df.group_by(..).agg([..])
df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
# based ondf Calculation,get new form
out = df.select(
    pl.sum("nrs"), # nrsof and
    pl.col("names").sort(), # namesSorted results
    pl.col("names").first().alias("first name"), # namesfirst value
    (pl.mean("nrs") * 10).alias("10xnrs"), # nrsmean of * 10
)
print(out)
# originaldf Add new column
df = df.with_columns(
    pl.sum("nrs").alias("nrs_sum"),
    pl.col("random").count().alias("count"),
)
print(df)
out = df.filter(pl.col("nrs") > 2)
print(out)
out = df.group_by("groups").agg(
    pl.sum("nrs"),  # sum nrs by groups
    pl.col("random").count().alias("count"),  # count group members
    # sum random where name != null
    pl.col("random").filter(pl.col("names").is_not_null()).sum().name.suffix("_sum"),
    pl.col("names").reverse().alias("reversed names"),
)
print(out) 

Lazy / eager API

Polars supports two operating modes: lazy and eager. In eager API, queries are executed immediately, whereas in lazy API, queries are evaluated only when “needed”.

!wget https://mirror.coggle.club/dataset/heart.csv
!head heart.csv
df = pl.read_csv("heart.csv")
df_small = df.filter(pl.col("age") > 5)
df_agg = df_small.group_by("sex").agg(pl.col("chol").mean())
print(df_agg)
q = (
    pl.scan_csv("heart.csv")
    .filter(pl.col("age") > 5)
    .group_by("sex")
    .agg(pl.col("chol").mean())
)
# Generated calculation logic

df = q.collect() # real calculation
print(df) 

Streaming API

https://pola-rs.github.io/polars/user-guide/concepts/streaming/

Another added benefit of the Lazy API is that it allows queries to be executed in a streaming manner. Instead of processing all data at once, Polars can execute queries in batches, allowing you to process data sets larger than memory.

q = (
    pl.scan_csv("heart.csv")
    .filter(pl.col("age") > 5)
    .group_by("sex")
    .agg(pl.col("chol").mean())
)

df = q.collect(streaming=True)
print(df) 

# 4. Expression

Basic operators

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
df_numerical = df.select(
    (pl.col("nrs") + 5).alias("nrs + 5"),
    (pl.col("nrs") - 5).alias("nrs - 5"),
    (pl.col("nrs") * pl.col("random")).alias("nrs * random"),
    (pl.col("nrs") / pl.col("random")).alias("nrs / random"),
)
print(df_numerical)
df_logical = df.select(
    (pl.col("nrs") > 1).alias("nrs > 1"),
    (pl.col("random") <= 0.5).alias("random <= .5"),
    (pl.col("nrs") != 1).alias("nrs != 1"),
    (pl.col("nrs") == 1).alias("nrs == 1"),
    ((pl.col("random") <= 0.5) & (pl.col("nrs") > 1)).alias("and_expr"),  # and
    ((pl.col("random") <= 0.5) | (pl.col("nrs") > 1)).alias("or_expr"),  # or
)
print(df_logical) 

Column selections

from datetime import date, datetime

df = pl.DataFrame(
    {
        "id": [9, 4, 2],
        "place": ["Mars", "Earth", "Saturn"],
        "date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
        "sales": [33.4, 2142134.1, 44.7],
        "has_people": [False, True, False],
        "logged_at": pl.datetime_range(
            datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
        ),
    }
).with_row_count("rn")
print(df)
out = df.select(pl.col("*"))

# Is equivalent to
out = df.select(pl.all())
print(out)
out = df.select(pl.col("*").exclude("logged_at", "rn"))
print(out)
out = df.select(pl.col("date", "logged_at").dt.to_string("%Y-%h-%d"))
print(out)
out = df.select(pl.col("^.*(as|sa).*$"))
print(out)
out = df.select(pl.col(pl.Int64, pl.UInt32, pl.Boolean).n_unique())
print(out)
import polars.selectors as cs
out = df.select(cs.numeric() - cs.first())
print(out)
out = df.select(cs.contains("rn"), cs.matches(".*_.*"))
print(out) 

Functions

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", "spam"],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
df_samename = df.select(pl.col("nrs") + 5)
print(df_samename)
df_alias = df.select(
    (pl.col("nrs") + 5).alias("nrs + 5"),
    (pl.col("nrs") - 5).alias("nrs - 5"),
)
print(df_alias)
df_alias = df.select(
    pl.col("names").n_unique().alias("unique"),
    pl.approx_n_unique("names").alias("unique_approx"),
)
print(df_alias)
df_conditional = df.select(
    pl.col("nrs"),
    pl.when(pl.col("nrs") > 2)
    .then(pl.lit(True))
    .otherwise(pl.lit(False))
    .alias("conditional"),
)
print(df_conditional) 

# 5. Conversion

Type conversion (Casting) converts the column's underlying DataType to a new data type. Polars uses Arrow to manage data in memory and relies on compute cores in the Rust implementation to perform transformations. Type conversion is achieved through the cast() method.

The cast method includes a strict parameter, which determines the behavior of Polars when it encounters a value that cannot be converted from the source DataType to the target DataType. By default, strict=True, which means that Polars will raise an error to notify the user of the conversion failure and provide details of the value that could not be converted. On the other hand, if strict=False, any value that cannot be converted to the target DataType will be silently converted to null.

df = pl.DataFrame(
    {
        "integers": [1, 2, 3, 4, 5],
        "big_integers": [1, 10000002, 3, 10000004, 10000005],
        "floats": [4.0, 5.0, 6.0, 7.0, 8.0],
        "floats_with_decimal": [4.532, 5.5, 6.5, 7.5, 8.5],
    }
)

print(df)
out = df.select(
    pl.col("integers").cast(pl.Float32).alias("integers_as_floats"),
    pl.col("floats").cast(pl.Int32).alias("floats_as_integers"),
    pl.col("floats_with_decimal")
    .cast(pl.Int32)
    .alias("floats_with_decimal_as_integers"),
)
print(out)
out = df.select(
    pl.col("integers").cast(pl.Int16).alias("integers_smallfootprint"),
    pl.col("floats").cast(pl.Float32).alias("floats_smallfootprint"),
)
print(out)
df = pl.DataFrame(
    {
        "integers": [1, 2, 3, 4, 5],
        "float": [4.0, 5.03, 6.0, 7.0, 8.0],
        "floats_as_string": ["4.0", "5.0", "6.0", "7.0", "8.0"],
    }
)

out = df.select(
    pl.col("integers").cast(pl.Utf8),
    pl.col("float").cast(pl.Utf8),
    pl.col("floats_as_string").cast(pl.Float64),
)
print(out)
df = pl.DataFrame(
    {
        "integers": [-1, 0, 2, 3, 4],
        "floats": [0.0, 1.0, 2.0, 3.0, 4.0],
        "bools": [True, False, True, False, True],
    }
)

out = df.select(pl.col("integers").cast(pl.Boolean), pl.col("floats").cast(pl.Boolean))
print(out)
from datetime import date, datetime

df = pl.DataFrame(
    {
        "date": pl.date_range(date(2022, 1, 1), date(2022, 1, 5), eager=True),
        "datetime": pl.datetime_range(
            datetime(2022, 1, 1), datetime(2022, 1, 5), eager=True
        ),
    }
)

out = df.select(pl.col("date").cast(pl.Int64), pl.col("datetime").cast(pl.Int64))
print(out) 

Strings

df = pl.DataFrame({"animal": ["Crab", "cat and dog", "rabbit", None]})

out = df.select(
    pl.col("animal").str.len_bytes().alias("byte_count"),
    pl.col("animal").str.len_chars().alias("letter_count"),
)
print(out)
out = df.select(
    pl.col("animal"),
    pl.col("animal").str.contains("cat|bit").alias("regex"),
    pl.col("animal").str.contains("rab", literal=True).alias("literal"),
    pl.col("animal").str.starts_with("rab").alias("starts_with"),
    pl.col("animal").str.ends_with("dog").alias("ends_with"),
)
print(out) 

Aggregation

https://pola-rs.github.io/polars/user-guide/expressions/aggregation/

df = pl.read_csv("heart.csv")
print(df)
q = (
    df.lazy()
    .group_by("sex")
    .agg(
        pl.count(),
        pl.col("cp"),
        pl.first("caa"),
    )
    .sort("count", descending=True)
    .limit(5)
)

df = q.collect()
print(df)
q = (
    df.lazy()
    .group_by("sex")
    .agg(
        (pl.col("cp") == 1).sum().alias("anti"),
        (pl.col("cp") == 2).sum().alias("pro"),
    )
    .sort("pro", descending=True)
    .limit(5)
)

df = q.collect()
print(df) 

# 6. Missing values

df = pl.DataFrame(
    {
        "value": [1, None],
    },
)
print(df)
null_count_df = df.null_count()
print(null_count_df)
df = pl.DataFrame(
    {
        "col1": [1, 2, 3],
        "col2": [1, None, 3],
    },
)
print(df)
fill_literal_df = df.with_columns(
    pl.col("col2").fill_null(pl.lit(2)),
)
print(fill_literal_df)
fill_forward_df = df.with_columns(
    pl.col("col2").fill_null(strategy="forward"),
)
print(fill_forward_df)
fill_median_df = df.with_columns(
    pl.col("col2").fill_null(pl.median("col2")),
)
print(fill_median_df)
fill_interpolation_df = df.with_columns(
    pl.col("col2").interpolate(),
)
print(fill_interpolation_df) 

Window functions

https://pola-rs.github.io/polars/user-guide/expressions/window/

!wget https://cdn.coggle.club/Pokemon.csv
!head Pokemon.csv
# then let's load some csv data with information about pokemon
df = pl.read_csv("Pokemon.csv")
print(df.head())
out = df.select(
    "Type 1",
    "Type 2",
    pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"),
    pl.col("Defense")
    .mean()
    .over(["Type 1", "Type 2"])
    .alias("avg_defense_by_type_combination"),
    pl.col("Attack").mean().alias("avg_attack"),
)
print(out)
filtered = df.filter(pl.col("Type 2") == "Psychic").select(
    "Name",
    "Type 1",
    "Speed",
)
print(filtered)
out = filtered.with_columns(
    pl.col(["Name", "Speed"]).sort_by("Speed", descending=True).over("Type 1"),
)
print(out) 

Lists and Arrays

weather = pl.DataFrame(
    {
        "station": ["Station " + str(x) for x in range(1, 6)],
        "temperatures": [
            "20 5 5 E1 7 13 19 9 6 20",
            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
            "19 24 E9 16 6 12 10 22",
            "E2 E0 15 7 8 10 E1 24 17 13 6",
            "14 8 E0 16 22 24 E1",
        ],
    }
)
print(weather)
out = weather.with_columns(pl.col("temperatures").str.split(" "))
print(out)
out = weather.with_columns(pl.col("temperatures").str.split(" ")).explode(
    "temperatures"
)
print(out)
out = weather.with_columns(pl.col("temperatures").str.split(" ")).with_columns(
    pl.col("temperatures").list.head(3).alias("top3"),
    pl.col("temperatures").list.slice(-3, 3).alias("bottom_3"),
    pl.col("temperatures").list.len().alias("obs"),
)
print(out) 

# 7. Transformation

Joins

Strategy Description
inner Returns rows with matching keys in two data frames. Non-matching rows in the left or right box are discarded.
left Returns all rows in the left data frame, regardless of whether a match is found in the right data frame. The right column of non-matching rows will be filled with null.
outer Returns all rows in the left and right data frames. If no match is found in one box, the column from the other box will be filled with null.
cross Returns the Cartesian product of all rows in the left box and all rows in the right box. Duplicate rows are retained; the table length of a cross-join of the left and right boxes is always len(A) × len(B).
asof In this join, the matching is a left join performed based on the nearest key rather than the equal key.
semi Returns all rows in the left box that have the same join key as in the right box.
anti Returns all rows in the left frame where the join key does not occur in the right frame.
df_customers = pl.DataFrame(
    {
        "customer_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
    }
)
print(df_customers)
df_orders = pl.DataFrame(
    {
        "order_id": ["a", "b", "c"],
        "customer_id": [1, 2, 2],
        "amount": [100, 200, 300],
    }
)
print(df_orders)
df = df_customers.join(df_orders, on="customer_id", how="inner")
print(df)

df = df_customers.join(df_orders, on="customer_id", how="left")
print(df)

df = df_customers.join(df_orders, on="customer_id", how="outer")
print(df)

df = df_customers.join(df_orders, on="customer_id", how="cross")
print(df)
df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "make": ["ford", "toyota", "bmw"],
    }
)
print(df_cars)
df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
print(df_repairs)
df = df_cars.join(df_repairs, on="id", how="inner")
print(df)

df = df_cars.join(df_repairs, on="id", how="semi")
print(df)

df = df_cars.join(df_repairs, on="id", how="anti")
print(df)
df_trades = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 3, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "B", "C"],
        "trade": [101, 299, 301, 500],
    }
)
print(df_trades)

df_quotes = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 0, 0),
            datetime(2020, 1, 1, 9, 2, 0),
            datetime(2020, 1, 1, 9, 4, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "C", "A"],
        "quote": [100, 300, 501, 102],
    }
)

print(df_quotes)
df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
df_asof_tolerance_join = df_trades.join_asof(
    df_quotes, on="time", by="stock", tolerance="1m"
)
print(df_asof_tolerance_join) 

Concatenation

df_v1 = pl.DataFrame(
    {
        "a": [1],
        "b": [3],
    }
)
df_v2 = pl.DataFrame(
    {
        "a": [2],
        "b": [4],
    }
)
df_vertical_concat = pl.concat(
    [
        df_v1,
        df_v2,
    ],
    how="vertical",
)
print(df_vertical_concat)
df_h1 = pl.DataFrame(
    {
        "l1": [1, 2],
        "l2": [3, 4],
    }
)
df_h2 = pl.DataFrame(
    {
        "r1": [5, 6],
        "r2": [7, 8],
        "r3": [9, 10],
    }
)
df_horizontal_concat = pl.concat(
    [
        df_h1,
        df_h2,
    ],
    how="horizontal",
)
print(df_horizontal_concat) 

Pivots

df = pl.DataFrame(
    {
        "foo": ["A", "A", "B", "B", "C"],
        "N": [1, 2, 2, 4, 2],
        "bar": ["k", "l", "m", "n", "o"],
    }
)
print(df)
out = df.pivot(index="foo", columns="bar", values="N", aggregate_function="first")
print(out)
q = (
    df.lazy()
    .collect()
    .pivot(index="foo", columns="bar", values="N", aggregate_function="first")
    .lazy()
)
out = q.collect()
print(out) 

Melts

import polars as pl

df = pl.DataFrame(
    {
        "A": ["a", "b", "a"],
        "B": [1, 3, 5],
        "C": [10, 11, 12],
        "D": [2, 4, 6],
    }
)
print(df)
out = df.melt(id_vars=["A", "B"], value_vars=["C", "D"])
print(out) 
RELATED ARTICLES

Most Popular

Recent Comments