Emil Sadek
 GitHub  LinkedIn

Data Validation for DuckDB

August 28, 2025

If you’re a data analyst, scientist, or engineer, you’ve likely used DuckDB in some capacity. It’s an impressively powerful tool that has rightfully gained significant popularity in recent years. While DuckDB simplifies data processing, working with data often means encountering data quality issues. The 2025 State of Analytics Engineering Report from dbt Labs found that “data quality remains the most critical challenge for data teams to solve.” This raises an important question: How can we ensure data quality when using DuckDB for data workflows?

Pointblank is a powerful, open-source data validation framework for Python. With its intuitive, chainable API and interactive reports, Pointblank makes it easy to define comprehensive data quality checks and visualize the results. DuckDB and other database systems are supported via Pointblank’s integration with Ibis, a portable Python dataframe library that works with many different backends.

Install pointblank with the duckdb and pl (Polars) extras:

pip install "pointblank[duckdb,pl]"

Polars is used for rendering reporting results.

Import ibis and pointblank:

import ibis
import pointblank as pb

Connect to DuckDB:

con = ibis.duckdb.connect()

This connects to an in-memory database. If you want to connect to a DuckDB database file or MotherDuck, see the Ibis DuckDB documentation.

Create a table:

con.raw_sql("""
CREATE TABLE titanic AS
    SELECT * REPLACE (Survived::BOOLEAN AS Survived)
    FROM 'https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv'
""")

We’ll use the famous Titanic dataset in this walkthrough.

Create a table expression:

titanic = con.table("titanic")

This table expression is used by Pointblank to query DuckDB.

Preview the table:

pb.preview(titanic)
DuckDBRows891Columns12
PassengerId
int64
Survived
boolean
Pclass
int64
Name
string
Sex
string
Age
float64
SibSp
int64
Parch
int64
Ticket
string
Fare
float64
Cabin
string
Embarked
string
11False3Braund, Mr. Owen Harrismale22.010A/5 211717.25NULLS
22True1Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
33True3Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925NULLS
44True1Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1C123S
55False3Allen, Mr. William Henrymale35.0003734508.05NULLS
887887False2Montvila, Rev. Juozasmale27.00021153613.0NULLS
888888True1Graham, Miss. Margaret Edithfemale19.00011205330.0B42S
889889False3Johnston, Miss. Catherine Helen "Carrie"femaleNULL12W./C. 660723.45NULLS
890890True1Behr, Mr. Karl Howellmale26.00011136930.0C148C
891891False3Dooley, Mr. Patrickmale32.0003703767.75NULLQ

The preview shows the the table type and dimensions, column names and types, as well as the first and last five rows. Previewing is useful for gaining a preliminary understanding of the dataset before diving into validation.

To improve our understanding, it can also be useful to inspect column-level summary statistics.

Summarize the table:

pb.col_summary_tbl(titanic)
DuckDBRows891Columns12
ColumnNAUQMeanSDMinP5Q1MedQ3P95MaxIQR
1numeric
PassengerId
int64
0
0.00
891
1.00
4462571.0045.0223446669847891445
2boolean
Survived
boolean
0
0.00
T 0.38
F 0.62
3numeric
Pclass
int64
0
0.00
3
<0.01
2.310.841.001.001.873.003.003.003.001.00
4string
Name
string
0
0.00
891
1.00
27.0
SL
9.28
SL
12
SL
25
SL
82
SL
5string
Sex
string
0
0.00
2
<0.01
4.70
SL
0.96
SL
4
SL
4
SL
6
SL
6numeric
Age
float64
177
0.20
87
0.10
29.714.50.424.0020.428.038.356.580.017.9
7numeric
SibSp
int64
0
0.00
7
<0.01
0.521.100.000.000.000.001.002.638.001.00
8numeric
Parch
int64
0
0.00
7
<0.01
0.380.810.000.000.000.000.082.006.000.00
9string
Ticket
string
0
0.00
681
0.76
6.75
SL
2.75
SL
3
SL
6
SL
18
SL
10numeric
Fare
float64
0
0.00
248
0.28
32.249.70.007.207.9114.530.911451223.1
11string
Cabin
string
687
0.77
146
0.16
3.59
SL
2.07
SL
1
SL
3
SL
15
SL
12string
Embarked
string
2
<0.01
2
<0.01
1.00
SL
0.00
SL
1
SL
1
SL
1
SL

The column summary table, like the preview, provides the table type and dimensions, as well as column names and types. For each column we also get missing count/proportion (NA), unique value count/proportion (UQ), mean (Mean), standard deviation (SD), minumum (Min), 5th percentile (P5), first quartile (Q1), median (Med), third quartile (Q3), 95th percentile (P95), maximum (Max), and interquartile range (IQR: Q3 - Q1).

After previewing and summarizing the table, you sould have a decent understanding of the dataset and an idea of which data quality checks to implement.

Create a validation:

validation = (
    pb.Validate(
        data=titanic,
        tbl_name="titanic",                                  # Name of the table for reporting
        label="Example titanic dataset validation",          # Label for the validation, appears in reports
        thresholds=(0.01, 0.02, 0.05),                       # Thresholds for warnings, errors, and critical issues
        brief=True,                                          # Add automatically-generated briefs for each step
    )
    .col_vals_not_null(columns="PassengerId")                # Validate values not Null
    .col_exists(columns=["Name", "Ticket"])                  # Validate columns exist
    .col_vals_between(columns="Pclass", left=1, right=3)     # Validate 1 <= values <= 3
    .col_vals_in_set(columns="Sex", set=["male", "female"])  # Validate values "male" or "female"
    .col_vals_lt(columns="Age", value=60, na_pass=True)      # Validate values < 5, Null allowed
    .interrogate()                                           # Execute and collect results
)

The Validate class’s data parameter takes the database table (expression) to be validated. The chained methods starting with col_ specify validation steps run on specific columns. Finally, the interrogate method executes the validation plan.

Get the validation report from a notebook:

validation

Or from the Python REPL:

validation.get_tabular_report().show()
Pointblank Validation
Example titanic dataset validation
DuckDBtitanicWARNING0.01ERROR0.02CRITICAL0.05
STEPCOLUMNSVALUESTBLEVALUNITSPASSFAILWECEXT
#4CA64C1
col_vals_not_null
 col_vals_not_null()

Expect that all values in PassengerId should not be Null.

PassengerId891891
1.00
0
0.00
#4CA64C2
col_exists
 col_exists()

Expect that column Name exists.

Name11
1.00
0
0.00
#4CA64C3
col_exists
 col_exists()

Expect that column Ticket exists.

Ticket11
1.00
0
0.00
#4CA64C4
col_vals_between
 col_vals_between()

Expect that values in Pclass should be between 1 and 3.

Pclass[1, 3]891891
1.00
0
0.00
#4CA64C5
col_vals_in_set
 col_vals_in_set()

Expect that values in Sex should be in the set of male, female.

Sexmale, female891891
1.00
0
0.00
#EBBC146
col_vals_lt
 col_vals_lt()

Expect that values in Age should be < 60.

Age60891865
0.97
26
0.03
2025-06-08 14:02:15 UTC< 1 s2025-06-08 14:02:15 UTC

The header provides the validation label, table type and name, and thresholds. For each step, the validation rules include the assertion type and brief (STEP), column name (COLUMNS), and applicable values (VALUES). The validation results include the test unit count (UNITS), passing count/proportion (PASS), failing count/proportion (FAIL), as well as warning (W), error (E) and critical (C) thresholds. The final extracts (EXT) column is not yet supported for DuckDB.

The validation results report can also be retreived as a JSON-formatted string or Python dictionary:

import json

json_report = validation.get_json_report()
dict_report = json.loads(json_report)

This is especially useful in data pipelines and scheduled data quality checks. You can log the report, send a Slack message, continue or fail the pipeline run, etc.

Pointblank has many other great features that I haven’t mentioned here. To learn more, take a look at the Pointblank user guide.

Thanks for reading and don’t forget to validate your data!