Polars is a high-performance DataFrame library for manipulating structured data. It can be said to be the most promising package to replace pandas. At its core, Polars is written in Rust, but the library also provides a Python interface. Its main features include:
- Fast: Polars is written from scratch, tightly integrated with the machine, and has no external dependencies.
- I/O: Best-in-class support for all common data storage tiers: local, cloud storage, and database.
- Ease of use: Write queries with original intent. Internally Polars uses its query optimizer to determine the most efficient way to perform.
- Offline processing: Polars supports offline data conversion through its streaming API. This allows you to process the results without having to store all the data in memory at the same time.
- Parallel Processing: Polars takes full advantage of computer performance by distributing the workload among available CPU cores without requiring additional configuration.
- Vectorized query engine: Polars uses Apache Arrow, a columnar data format, to process queries in a vectorized manner. It uses SIMD to optimize CPU usage.
User guide: https://pola-rs.github.io/polars/user-guide/
API reference: https://pola-rs.github.io/polars/py-polars/html/reference/io.html
introduce
Polars aims to provide a lightning-fast DataFrame
library with the following features:
- Utilizes all available cores on the computer.
- Reduce unnecessary work/memory allocations by optimizing queries.
- Process data sets much larger than available RAM.
- Have a consistent and predictable API.
- Have a strict schema (data types should be known before running the query).
Polars is written in Rust, which makes it C/C++ performant and allows it full control over performance-critical parts of the query engine. Therefore, Polars has put a lot of effort into this:
- Reduce redundant replication.
- Efficiently traverse in-memory caches.
- Minimize contention in parallelism.
- Process data in chunks.
- Reuse memory allocation.
# 1. Basics
Series & DataFrames
Series is a one-dimensional data structure. In a Series, all elements have the same data type (eg, integer, string). The following snippet shows how to create a simple Series object with a name.
import polars as pl
import numpy as np
s = pl.Series("a", [1, 2, 3, 4, 5])
print(s)
s = pl.Series("a", [1, 2, 3, 4, 5])
print(s.min())
print(s.max())
s = pl.Series("a", ["polar", "bear", "arctic", "polar fox", "polar bear"])
s2 = s.str.replace("polar", "pola")
print(s2)
from datetime import date
start = date(2001, 1, 1)
stop = date(2001, 1, 9)
s = pl.date_range(start, stop, interval="2d", eager=True)
print(s.dt.day())
DataFrame is a two-dimensional data structure supported by one or more Series and can be viewed as an abstraction of a series (such as a list) of Series. The operations you can perform on a DataFrame are very similar to those you can perform in SQL queries. You can perform GROUP BY, JOIN, PIVOT, and define custom functions.
from datetime import datetime
df = pl.DataFrame(
{
"integer": [1, 2, 3, 4, 5],
"date": [
datetime(2022, 1, 1),
datetime(2022, 1, 2),
datetime(2022, 1, 3),
datetime(2022, 1, 4),
datetime(2022, 1, 5),
],
"float": [4.0, 5.0, 6.0, 7.0, 8.0],
}
)
print(df)
print(df.head(3))
print(df.describe())
Reading & writing
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
datetime(2022, 1, 1),
datetime(2022, 1, 2),
datetime(2022, 1, 3),
],
"float": [4.0, 5.0, 6.0],
}
)
print(df)
df.write_csv("output.csv")
df_csv = pl.read_csv("output.csv")
print(df_csv)
df_csv = pl.read_csv("output.csv", try_parse_dates=True)
print(df_csv)
Expressions
import polars as pl
# Create a simple DataFrame
data = {'column1': [1, 2, 3],
'column2': ['a', 'b', 'c']}
df = pl.DataFrame(data)
# Use expressions to select
selected_df = df.select(['column1'])
# Filter using expressions
filtered_df = df.filter(df['column1'] > 1)
selected_df
filtered_df
# 2. Splicing
df = pl.DataFrame(
{
"a": np.arange(0, 8),
"b": np.random.rand(8),
"d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None],
}
)
df2 = pl.DataFrame(
{
"x": np.arange(0, 8),
"y": ["A", "A", "A", "B", "B", "C", "X", "X"],
}
)
joined = df.join(df2, left_on="a", right_on="x")
print(joined)
stacked = df.hstack(df2)
print(stacked)
# 3. Concept
Data types
Polars are based entirely on the Arrow data type and are backed by Arrow memory arrays. This enables data processing to be well supported in terms of cache efficiency and cross-process communication. Most data types are fully consistent with Arrow's implementation, with some exceptions, such as Utf8
(actually LargeUtf8
), Categorical
, and Object
(with limited support). Here are some data types:
Group | Type | Details |
---|---|---|
Number | Int8 |
8-bit signed integer. |
Int16 |
16-bit signed integer. | |
Int32 |
32-bit signed integer. | |
Int64 |
64-bit signed integer. | |
UInt8 |
8-bit unsigned integer. | |
UInt16 |
16-bit unsigned integer. | |
UInt32 |
32-bit unsigned integer. | |
UInt64 |
64-bit unsigned integer. | |
Float32 |
32-bit floating point number. | |
Float64 |
64-bit floating point number. | |
Nested | Struct |
Structure array represented as Vec<Series> , for packing multiple/heterogeneous values in a single column. |
List |
A list array contains a subarray containing the list values and an offset array (internally it is actually Arrow's LargeList ). |
|
Time | Date |
Date representation, internally represented as the number of days since the UNIX epoch, encoded as a 32-bit signed integer. |
Datetime |
Datetime representation, internally represented as the number of microseconds since the UNIX epoch, encoded as a 64-bit signed integer. | |
Duration |
The type representing the time difference, internally represented as microseconds. Created by subtracting Date/Datetime . |
|
Time |
Time representation, internally expressed as nanoseconds from midnight. | |
Others | Boolean |
Boolean type, actually packed in bits. |
Utf8 |
String data (actually Arrow's LargeUtf8 internally). |
|
Binary |
Stores data in bytes. | |
Object |
Limited supported data type, can be any value. | |
Categorical |
A categorical encoding of a set of strings. |
Contexts
Polars has developed its own Domain Specific Language (DSL) for data transformation. The language is very easy to use, allowing complex queries to be made while maintaining human readability. The two core components of the language are Contexts and Expressions, which we'll cover in the next section.
As the name suggests, context refers to the context in which an expression needs to be evaluated. There are three main contexts[1]:
- Selection:
df.select([..])
,df.with_columns([..])
- Filtering:
df.filter()
- Group by / Aggregation:
df.group_by(..).agg([..])
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
# based ondf Calculation,get new form
out = df.select(
pl.sum("nrs"), # nrsof and
pl.col("names").sort(), # namesSorted results
pl.col("names").first().alias("first name"), # namesfirst value
(pl.mean("nrs") * 10).alias("10xnrs"), # nrsmean of * 10
)
print(out)
# originaldf Add new column
df = df.with_columns(
pl.sum("nrs").alias("nrs_sum"),
pl.col("random").count().alias("count"),
)
print(df)
out = df.filter(pl.col("nrs") > 2)
print(out)
out = df.group_by("groups").agg(
pl.sum("nrs"), # sum nrs by groups
pl.col("random").count().alias("count"), # count group members
# sum random where name != null
pl.col("random").filter(pl.col("names").is_not_null()).sum().name.suffix("_sum"),
pl.col("names").reverse().alias("reversed names"),
)
print(out)
Lazy / eager API
Polars supports two operating modes: lazy and eager. In eager API, queries are executed immediately, whereas in lazy API, queries are evaluated only when “needed”.
!wget https://mirror.coggle.club/dataset/heart.csv
!head heart.csv
df = pl.read_csv("heart.csv")
df_small = df.filter(pl.col("age") > 5)
df_agg = df_small.group_by("sex").agg(pl.col("chol").mean())
print(df_agg)
q = (
pl.scan_csv("heart.csv")
.filter(pl.col("age") > 5)
.group_by("sex")
.agg(pl.col("chol").mean())
)
# Generated calculation logic
df = q.collect() # real calculation
print(df)
Streaming API
https://pola-rs.github.io/polars/user-guide/concepts/streaming/
Another added benefit of the Lazy API is that it allows queries to be executed in a streaming manner. Instead of processing all data at once, Polars can execute queries in batches, allowing you to process data sets larger than memory.
q = (
pl.scan_csv("heart.csv")
.filter(pl.col("age") > 5)
.group_by("sex")
.agg(pl.col("chol").mean())
)
df = q.collect(streaming=True)
print(df)
# 4. Expression
Basic operators
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
df_numerical = df.select(
(pl.col("nrs") + 5).alias("nrs + 5"),
(pl.col("nrs") - 5).alias("nrs - 5"),
(pl.col("nrs") * pl.col("random")).alias("nrs * random"),
(pl.col("nrs") / pl.col("random")).alias("nrs / random"),
)
print(df_numerical)
df_logical = df.select(
(pl.col("nrs") > 1).alias("nrs > 1"),
(pl.col("random") <= 0.5).alias("random <= .5"),
(pl.col("nrs") != 1).alias("nrs != 1"),
(pl.col("nrs") == 1).alias("nrs == 1"),
((pl.col("random") <= 0.5) & (pl.col("nrs") > 1)).alias("and_expr"), # and
((pl.col("random") <= 0.5) | (pl.col("nrs") > 1)).alias("or_expr"), # or
)
print(df_logical)
Column selections
from datetime import date, datetime
df = pl.DataFrame(
{
"id": [9, 4, 2],
"place": ["Mars", "Earth", "Saturn"],
"date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
"sales": [33.4, 2142134.1, 44.7],
"has_people": [False, True, False],
"logged_at": pl.datetime_range(
datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
),
}
).with_row_count("rn")
print(df)
out = df.select(pl.col("*"))
# Is equivalent to
out = df.select(pl.all())
print(out)
out = df.select(pl.col("*").exclude("logged_at", "rn"))
print(out)
out = df.select(pl.col("date", "logged_at").dt.to_string("%Y-%h-%d"))
print(out)
out = df.select(pl.col("^.*(as|sa).*$"))
print(out)
out = df.select(pl.col(pl.Int64, pl.UInt32, pl.Boolean).n_unique())
print(out)
import polars.selectors as cs
out = df.select(cs.numeric() - cs.first())
print(out)
out = df.select(cs.contains("rn"), cs.matches(".*_.*"))
print(out)
Functions
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", "spam"],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
df_samename = df.select(pl.col("nrs") + 5)
print(df_samename)
df_alias = df.select(
(pl.col("nrs") + 5).alias("nrs + 5"),
(pl.col("nrs") - 5).alias("nrs - 5"),
)
print(df_alias)
df_alias = df.select(
pl.col("names").n_unique().alias("unique"),
pl.approx_n_unique("names").alias("unique_approx"),
)
print(df_alias)
df_conditional = df.select(
pl.col("nrs"),
pl.when(pl.col("nrs") > 2)
.then(pl.lit(True))
.otherwise(pl.lit(False))
.alias("conditional"),
)
print(df_conditional)
# 5. Conversion
Type conversion (Casting) converts the column's underlying DataType
to a new data type. Polars uses Arrow to manage data in memory and relies on compute cores in the Rust implementation to perform transformations. Type conversion is achieved through the cast()
method.
The cast
method includes a strict
parameter, which determines the behavior of Polars when it encounters a value that cannot be converted from the source DataType
to the target DataType
. By default, strict=True
, which means that Polars will raise an error to notify the user of the conversion failure and provide details of the value that could not be converted. On the other hand, if strict=False
, any value that cannot be converted to the target DataType
will be silently converted to null
.
df = pl.DataFrame(
{
"integers": [1, 2, 3, 4, 5],
"big_integers": [1, 10000002, 3, 10000004, 10000005],
"floats": [4.0, 5.0, 6.0, 7.0, 8.0],
"floats_with_decimal": [4.532, 5.5, 6.5, 7.5, 8.5],
}
)
print(df)
out = df.select(
pl.col("integers").cast(pl.Float32).alias("integers_as_floats"),
pl.col("floats").cast(pl.Int32).alias("floats_as_integers"),
pl.col("floats_with_decimal")
.cast(pl.Int32)
.alias("floats_with_decimal_as_integers"),
)
print(out)
out = df.select(
pl.col("integers").cast(pl.Int16).alias("integers_smallfootprint"),
pl.col("floats").cast(pl.Float32).alias("floats_smallfootprint"),
)
print(out)
df = pl.DataFrame(
{
"integers": [1, 2, 3, 4, 5],
"float": [4.0, 5.03, 6.0, 7.0, 8.0],
"floats_as_string": ["4.0", "5.0", "6.0", "7.0", "8.0"],
}
)
out = df.select(
pl.col("integers").cast(pl.Utf8),
pl.col("float").cast(pl.Utf8),
pl.col("floats_as_string").cast(pl.Float64),
)
print(out)
df = pl.DataFrame(
{
"integers": [-1, 0, 2, 3, 4],
"floats": [0.0, 1.0, 2.0, 3.0, 4.0],
"bools": [True, False, True, False, True],
}
)
out = df.select(pl.col("integers").cast(pl.Boolean), pl.col("floats").cast(pl.Boolean))
print(out)
from datetime import date, datetime
df = pl.DataFrame(
{
"date": pl.date_range(date(2022, 1, 1), date(2022, 1, 5), eager=True),
"datetime": pl.datetime_range(
datetime(2022, 1, 1), datetime(2022, 1, 5), eager=True
),
}
)
out = df.select(pl.col("date").cast(pl.Int64), pl.col("datetime").cast(pl.Int64))
print(out)
Strings
df = pl.DataFrame({"animal": ["Crab", "cat and dog", "rabbit", None]})
out = df.select(
pl.col("animal").str.len_bytes().alias("byte_count"),
pl.col("animal").str.len_chars().alias("letter_count"),
)
print(out)
out = df.select(
pl.col("animal"),
pl.col("animal").str.contains("cat|bit").alias("regex"),
pl.col("animal").str.contains("rab", literal=True).alias("literal"),
pl.col("animal").str.starts_with("rab").alias("starts_with"),
pl.col("animal").str.ends_with("dog").alias("ends_with"),
)
print(out)
Aggregation
https://pola-rs.github.io/polars/user-guide/expressions/aggregation/
df = pl.read_csv("heart.csv")
print(df)
q = (
df.lazy()
.group_by("sex")
.agg(
pl.count(),
pl.col("cp"),
pl.first("caa"),
)
.sort("count", descending=True)
.limit(5)
)
df = q.collect()
print(df)
q = (
df.lazy()
.group_by("sex")
.agg(
(pl.col("cp") == 1).sum().alias("anti"),
(pl.col("cp") == 2).sum().alias("pro"),
)
.sort("pro", descending=True)
.limit(5)
)
df = q.collect()
print(df)
# 6. Missing values
df = pl.DataFrame(
{
"value": [1, None],
},
)
print(df)
null_count_df = df.null_count()
print(null_count_df)
df = pl.DataFrame(
{
"col1": [1, 2, 3],
"col2": [1, None, 3],
},
)
print(df)
fill_literal_df = df.with_columns(
pl.col("col2").fill_null(pl.lit(2)),
)
print(fill_literal_df)
fill_forward_df = df.with_columns(
pl.col("col2").fill_null(strategy="forward"),
)
print(fill_forward_df)
fill_median_df = df.with_columns(
pl.col("col2").fill_null(pl.median("col2")),
)
print(fill_median_df)
fill_interpolation_df = df.with_columns(
pl.col("col2").interpolate(),
)
print(fill_interpolation_df)
Window functions
https://pola-rs.github.io/polars/user-guide/expressions/window/
!wget https://cdn.coggle.club/Pokemon.csv
!head Pokemon.csv
# then let's load some csv data with information about pokemon
df = pl.read_csv("Pokemon.csv")
print(df.head())
out = df.select(
"Type 1",
"Type 2",
pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"),
pl.col("Defense")
.mean()
.over(["Type 1", "Type 2"])
.alias("avg_defense_by_type_combination"),
pl.col("Attack").mean().alias("avg_attack"),
)
print(out)
filtered = df.filter(pl.col("Type 2") == "Psychic").select(
"Name",
"Type 1",
"Speed",
)
print(filtered)
out = filtered.with_columns(
pl.col(["Name", "Speed"]).sort_by("Speed", descending=True).over("Type 1"),
)
print(out)
Lists and Arrays
weather = pl.DataFrame(
{
"station": ["Station " + str(x) for x in range(1, 6)],
"temperatures": [
"20 5 5 E1 7 13 19 9 6 20",
"18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
"19 24 E9 16 6 12 10 22",
"E2 E0 15 7 8 10 E1 24 17 13 6",
"14 8 E0 16 22 24 E1",
],
}
)
print(weather)
out = weather.with_columns(pl.col("temperatures").str.split(" "))
print(out)
out = weather.with_columns(pl.col("temperatures").str.split(" ")).explode(
"temperatures"
)
print(out)
out = weather.with_columns(pl.col("temperatures").str.split(" ")).with_columns(
pl.col("temperatures").list.head(3).alias("top3"),
pl.col("temperatures").list.slice(-3, 3).alias("bottom_3"),
pl.col("temperatures").list.len().alias("obs"),
)
print(out)
# 7. Transformation
Joins
Strategy | Description |
---|---|
inner |
Returns rows with matching keys in two data frames. Non-matching rows in the left or right box are discarded. |
left |
Returns all rows in the left data frame, regardless of whether a match is found in the right data frame. The right column of non-matching rows will be filled with null. |
outer |
Returns all rows in the left and right data frames. If no match is found in one box, the column from the other box will be filled with null. |
cross |
Returns the Cartesian product of all rows in the left box and all rows in the right box. Duplicate rows are retained; the table length of a cross-join of the left and right boxes is always len(A) × len(B) . |
asof |
In this join, the matching is a left join performed based on the nearest key rather than the equal key. |
semi |
Returns all rows in the left box that have the same join key as in the right box. |
anti |
Returns all rows in the left frame where the join key does not occur in the right frame. |
df_customers = pl.DataFrame(
{
"customer_id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
}
)
print(df_customers)
df_orders = pl.DataFrame(
{
"order_id": ["a", "b", "c"],
"customer_id": [1, 2, 2],
"amount": [100, 200, 300],
}
)
print(df_orders)
df = df_customers.join(df_orders, on="customer_id", how="inner")
print(df)
df = df_customers.join(df_orders, on="customer_id", how="left")
print(df)
df = df_customers.join(df_orders, on="customer_id", how="outer")
print(df)
df = df_customers.join(df_orders, on="customer_id", how="cross")
print(df)
df_cars = pl.DataFrame(
{
"id": ["a", "b", "c"],
"make": ["ford", "toyota", "bmw"],
}
)
print(df_cars)
df_repairs = pl.DataFrame(
{
"id": ["c", "c"],
"cost": [100, 200],
}
)
print(df_repairs)
df = df_cars.join(df_repairs, on="id", how="inner")
print(df)
df = df_cars.join(df_repairs, on="id", how="semi")
print(df)
df = df_cars.join(df_repairs, on="id", how="anti")
print(df)
df_trades = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 3, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "B", "C"],
"trade": [101, 299, 301, 500],
}
)
print(df_trades)
df_quotes = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 0, 0),
datetime(2020, 1, 1, 9, 2, 0),
datetime(2020, 1, 1, 9, 4, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "C", "A"],
"quote": [100, 300, 501, 102],
}
)
print(df_quotes)
df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
df_asof_tolerance_join = df_trades.join_asof(
df_quotes, on="time", by="stock", tolerance="1m"
)
print(df_asof_tolerance_join)
Concatenation
df_v1 = pl.DataFrame(
{
"a": [1],
"b": [3],
}
)
df_v2 = pl.DataFrame(
{
"a": [2],
"b": [4],
}
)
df_vertical_concat = pl.concat(
[
df_v1,
df_v2,
],
how="vertical",
)
print(df_vertical_concat)
df_h1 = pl.DataFrame(
{
"l1": [1, 2],
"l2": [3, 4],
}
)
df_h2 = pl.DataFrame(
{
"r1": [5, 6],
"r2": [7, 8],
"r3": [9, 10],
}
)
df_horizontal_concat = pl.concat(
[
df_h1,
df_h2,
],
how="horizontal",
)
print(df_horizontal_concat)
Pivots
df = pl.DataFrame(
{
"foo": ["A", "A", "B", "B", "C"],
"N": [1, 2, 2, 4, 2],
"bar": ["k", "l", "m", "n", "o"],
}
)
print(df)
out = df.pivot(index="foo", columns="bar", values="N", aggregate_function="first")
print(out)
q = (
df.lazy()
.collect()
.pivot(index="foo", columns="bar", values="N", aggregate_function="first")
.lazy()
)
out = q.collect()
print(out)
Melts
import polars as pl
df = pl.DataFrame(
{
"A": ["a", "b", "a"],
"B": [1, 3, 5],
"C": [10, 11, 12],
"D": [2, 4, 6],
}
)
print(df)
out = df.melt(id_vars=["A", "B"], value_vars=["C", "D"])
print(out)