Data Validation for DuckDB

August 28, 2025

If you’re a data analyst, scientist, or engineer, you’ve likely used DuckDB in some capacity. It’s an impressively powerful tool that has rightfully gained significant popularity in recent years. While DuckDB simplifies data processing, working with data often means encountering data quality issues. The 2025 State of Analytics Engineering Report from dbt Labs found that “data quality remains the most critical challenge for data teams to solve.” This raises an important question: How can we ensure data quality when using DuckDB for data workflows?

Pointblank is a powerful, open-source data validation framework for Python. With its intuitive, chainable API and interactive reports, Pointblank makes it easy to define comprehensive data quality checks and visualize the results. DuckDB and other database systems are supported via Pointblank’s integration with Ibis, a portable Python dataframe library that works with many different backends.

Install pointblank with the duckdb and pl (Polars) extras:

pip install "pointblank[duckdb,pl]"

Polars is used for rendering reporting results.

Import ibis and pointblank:

import ibis
import pointblank as pb

Connect to DuckDB:

con = ibis.duckdb.connect()

This connects to an in-memory database. If you want to connect to a DuckDB database file or MotherDuck, see the Ibis DuckDB documentation.

Create a table:

con.raw_sql("""
CREATE TABLE titanic AS
    SELECT * REPLACE (Survived::BOOLEAN AS Survived)
    FROM 'https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv'
""")

We’ll use the famous Titanic dataset in this walkthrough.

Create a table expression:

titanic = con.table("titanic")

This table expression is used by Pointblank to query DuckDB.

Preview the table:

pb.preview(titanic)

	PassengerId int64	Survived boolean	Pclass int64	Name string	Sex string	Age float64	SibSp int64	Parch int64	Ticket string	Fare float64	Cabin string	Embarked string
DuckDBRows891Columns12
1	1	False	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.25	NULL	S
2	2	True	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.2833	C85	C
3	3	True	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.925	NULL	S
4	4	True	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1	C123	S
5	5	False	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.05	NULL	S
887	887	False	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0	NULL	S
888	888	True	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0	B42	S
889	889	False	3	Johnston, Miss. Catherine Helen "Carrie"	female	NULL	1	2	W./C. 6607	23.45	NULL	S
890	890	True	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0	C148	C
891	891	False	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NULL	Q

The preview shows the the table type and dimensions, column names and types, as well as the first and last five rows. Previewing is useful for gaining a preliminary understanding of the dataset before diving into validation.

To improve our understanding, it can also be useful to inspect column-level summary statistics.

Summarize the table:

pb.col_summary_tbl(titanic)

	Column	NA	UQ	Mean	SD	Min	P5	Q1	Med	Q3	P95	Max	IQR
DuckDBRows891Columns12
1	PassengerId int64	0 0.00	891 1.00	446	257	1.00	45.0	223	446	669	847	891	445
2	Survived boolean	0 0.00	T 0.38 F 0.62	—	—	—	—	—	—	—	—	—	—
3	Pclass int64	0 0.00	3 <0.01	2.31	0.84	1.00	1.00	1.87	3.00	3.00	3.00	3.00	1.00
4	Name string	0 0.00	891 1.00	27.0 SL	9.28 SL	12 SL	—	—	25 SL	—	—	82 SL	—
5	Sex string	0 0.00	2 <0.01	4.70 SL	0.96 SL	4 SL	—	—	4 SL	—	—	6 SL	—
6	Age float64	177 0.20	87 0.10	29.7	14.5	0.42	4.00	20.4	28.0	38.3	56.5	80.0	17.9
7	SibSp int64	0 0.00	7 <0.01	0.52	1.10	0.00	0.00	0.00	0.00	1.00	2.63	8.00	1.00
8	Parch int64	0 0.00	7 <0.01	0.38	0.81	0.00	0.00	0.00	0.00	0.08	2.00	6.00	0.00
9	Ticket string	0 0.00	681 0.76	6.75 SL	2.75 SL	3 SL	—	—	6 SL	—	—	18 SL	—
10	Fare float64	0 0.00	248 0.28	32.2	49.7	0.00	7.20	7.91	14.5	30.9	114	512	23.1
11	Cabin string	687 0.77	146 0.16	3.59 SL	2.07 SL	1 SL	—	—	3 SL	—	—	15 SL	—
12	Embarked string	2 <0.01	2 <0.01	1.00 SL	0.00 SL	1 SL	—	—	1 SL	—	—	1 SL	—

The column summary table, like the preview, provides the table type and dimensions, as well as column names and types. For each column we also get missing count/proportion (NA), unique value count/proportion (UQ), mean (Mean), standard deviation (SD), minumum (Min), 5th percentile (P5), first quartile (Q1), median (Med), third quartile (Q3), 95th percentile (P95), maximum (Max), and interquartile range (IQR: Q3 - Q1).

After previewing and summarizing the table, you sould have a decent understanding of the dataset and an idea of which data quality checks to implement.

Create a validation:

validation = (
    pb.Validate(
        data=titanic,
        tbl_name="titanic",                                  # Name of the table for reporting
        label="Example titanic dataset validation",          # Label for the validation, appears in reports
        thresholds=(0.01, 0.02, 0.05),                       # Thresholds for warnings, errors, and critical issues
        brief=True,                                          # Add automatically-generated briefs for each step
    )
    .col_vals_not_null(columns="PassengerId")                # Validate values not Null
    .col_exists(columns=["Name", "Ticket"])                  # Validate columns exist
    .col_vals_between(columns="Pclass", left=1, right=3)     # Validate 1 <= values <= 3
    .col_vals_in_set(columns="Sex", set=["male", "female"])  # Validate values "male" or "female"
    .col_vals_lt(columns="Age", value=60, na_pass=True)      # Validate values < 5, Null allowed
    .interrogate()                                           # Execute and collect results
)

The Validate class’s data parameter takes the database table (expression) to be validated. The chained methods starting with col_ specify validation steps run on specific columns. Finally, the interrogate method executes the validation plan.

Get the validation report from a notebook:

validation

Or from the Python REPL:

validation.get_tabular_report().show()

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
Example titanic dataset validation DuckDBtitanicWARNING0.01ERROR0.02CRITICAL0.05
#4CA64C	1	col_vals_not_null() Expect that all values in `PassengerId` should not be Null.	PassengerId	—	✓	891	891 1.00	0 0.00	○	○	○	—
#4CA64C	2	col_exists() Expect that column `Name` exists.	Name	—	✓	1	1 1.00	0 0.00	○	○	○	—
#4CA64C	3	col_exists() Expect that column `Ticket` exists.	Ticket	—	✓	1	1 1.00	0 0.00	○	○	○	—
#4CA64C	4	col_vals_between() Expect that values in `Pclass` should be between `1` and `3`.	Pclass	[1, 3]	✓	891	891 1.00	0 0.00	○	○	○	—
#4CA64C	5	col_vals_in_set() Expect that values in `Sex` should be in the set of `male`, `female`.	Sex	male, female	✓	891	891 1.00	0 0.00	○	○	○	—
#EBBC14	6	col_vals_lt() Expect that values in `Age` should be < `60`.	Age	60	✓	891	865 0.97	26 0.03	●	●	○	—
2025-06-08 14:02:15 UTC< 1 s2025-06-08 14:02:15 UTC

The header provides the validation label, table type and name, and thresholds. For each step, the validation rules include the assertion type and brief (STEP), column name (COLUMNS), and applicable values (VALUES). The validation results include the test unit count (UNITS), passing count/proportion (PASS), failing count/proportion (FAIL), as well as warning (W), error (E) and critical (C) thresholds. The final extracts (EXT) column is not yet supported for DuckDB.

The validation results report can also be retreived as a JSON-formatted string or Python dictionary:

import json

json_report = validation.get_json_report()
dict_report = json.loads(json_report)

This is especially useful in data pipelines and scheduled data quality checks. You can log the report, send a Slack message, continue or fail the pipeline run, etc.

Pointblank has many other great features that I haven’t mentioned here. To learn more, take a look at the Pointblank user guide.

Thanks for reading and don’t forget to validate your data!