August 28, 2025
If you’re a data analyst, scientist, or engineer, you’ve likely used DuckDB in some capacity. It’s an impressively powerful tool that has rightfully gained significant popularity in recent years. While DuckDB simplifies data processing, working with data often means encountering data quality issues. The 2025 State of Analytics Engineering Report from dbt Labs found that “data quality remains the most critical challenge for data teams to solve.” This raises an important question: How can we ensure data quality when using DuckDB for data workflows?
Pointblank is a powerful, open-source data validation framework for Python. With its intuitive, chainable API and interactive reports, Pointblank makes it easy to define comprehensive data quality checks and visualize the results. DuckDB and other database systems are supported via Pointblank’s integration with Ibis, a portable Python dataframe library that works with many different backends.
Install pointblank
with the duckdb
and pl
(Polars)
extras:
pip install "pointblank[duckdb,pl]"
Polars is used for rendering reporting results.
Import ibis
and pointblank
:
import ibis
import pointblank as pb
Connect to DuckDB:
con = ibis.duckdb.connect()
This connects to an in-memory database. If you want to connect to a DuckDB database file or MotherDuck, see the Ibis DuckDB documentation.
Create a table:
con.raw_sql("""
CREATE TABLE titanic AS
SELECT * REPLACE (Survived::BOOLEAN AS Survived)
FROM 'https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv'
""")
We’ll use the famous Titanic dataset in this walkthrough.
Create a table expression:
titanic = con.table("titanic")
This table expression is used by Pointblank to query DuckDB.
Preview the table:
pb.preview(titanic)
DuckDBRows891Columns12 | ||||||||||||
The preview shows the the table type and dimensions, column names and types, as well as the first and last five rows. Previewing is useful for gaining a preliminary understanding of the dataset before diving into validation.
To improve our understanding, it can also be useful to inspect column-level summary statistics.
Summarize the table:
pb.col_summary_tbl(titanic)
DuckDBRows891Columns12 | ||||||||||||||
Column | NA | UQ | Mean | SD | Min | P5 | Q1 | Med | Q3 | P95 | Max | IQR | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | int64 | 0 0.00 | 891 1.00 | 446 | 257 | 1.00 | 45.0 | 223 | 446 | 669 | 847 | 891 | 445 | |
2 | boolean | 0 0.00 | T 0.38 F 0.62 | — | — | — | — | — | — | — | — | — | — | |
3 | int64 | 0 0.00 | 3 <0.01 | 2.31 | 0.84 | 1.00 | 1.00 | 1.87 | 3.00 | 3.00 | 3.00 | 3.00 | 1.00 | |
4 | string | 0 0.00 | 891 1.00 | 27.0 SL | 9.28 SL | 12 SL | — | — | 25 SL | — | — | 82 SL | — | |
5 | string | 0 0.00 | 2 <0.01 | 4.70 SL | 0.96 SL | 4 SL | — | — | 4 SL | — | — | 6 SL | — | |
6 | float64 | 177 0.20 | 87 0.10 | 29.7 | 14.5 | 0.42 | 4.00 | 20.4 | 28.0 | 38.3 | 56.5 | 80.0 | 17.9 | |
7 | int64 | 0 0.00 | 7 <0.01 | 0.52 | 1.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 2.63 | 8.00 | 1.00 | |
8 | int64 | 0 0.00 | 7 <0.01 | 0.38 | 0.81 | 0.00 | 0.00 | 0.00 | 0.00 | 0.08 | 2.00 | 6.00 | 0.00 | |
9 | string | 0 0.00 | 681 0.76 | 6.75 SL | 2.75 SL | 3 SL | — | — | 6 SL | — | — | 18 SL | — | |
10 | float64 | 0 0.00 | 248 0.28 | 32.2 | 49.7 | 0.00 | 7.20 | 7.91 | 14.5 | 30.9 | 114 | 512 | 23.1 | |
11 | string | 687 0.77 | 146 0.16 | 3.59 SL | 2.07 SL | 1 SL | — | — | 3 SL | — | — | 15 SL | — | |
12 | string | 2 <0.01 | 2 <0.01 | 1.00 SL | 0.00 SL | 1 SL | — | — | 1 SL | — | — | 1 SL | — |
The column summary table, like the preview, provides the table type and
dimensions, as well as column names and types. For each column we also get
missing count/proportion (NA
), unique value count/proportion (UQ
), mean
(Mean
), standard deviation (SD
), minumum (Min
), 5th percentile (P5
),
first quartile (Q1
), median (Med
), third quartile (Q3
), 95th percentile
(P95
), maximum (Max
), and interquartile range (IQR
: Q3
- Q1
).
After previewing and summarizing the table, you sould have a decent understanding of the dataset and an idea of which data quality checks to implement.
Create a validation:
validation = (
pb.Validate(
data=titanic,
tbl_name="titanic", # Name of the table for reporting
label="Example titanic dataset validation", # Label for the validation, appears in reports
thresholds=(0.01, 0.02, 0.05), # Thresholds for warnings, errors, and critical issues
brief=True, # Add automatically-generated briefs for each step
)
.col_vals_not_null(columns="PassengerId") # Validate values not Null
.col_exists(columns=["Name", "Ticket"]) # Validate columns exist
.col_vals_between(columns="Pclass", left=1, right=3) # Validate 1 <= values <= 3
.col_vals_in_set(columns="Sex", set=["male", "female"]) # Validate values "male" or "female"
.col_vals_lt(columns="Age", value=60, na_pass=True) # Validate values < 5, Null allowed
.interrogate() # Execute and collect results
)
The Validate
class’s data
parameter takes the database table (expression) to
be validated. The chained methods starting with col_
specify validation steps
run on specific columns. Finally, the interrogate
method executes the
validation plan.
Get the validation report from a notebook:
validation
Or from the Python REPL:
validation.get_tabular_report().show()
Pointblank Validation | |||||||||||||
Example titanic dataset validation DuckDBtitanicWARNING0.01ERROR0.02CRITICAL0.05 | |||||||||||||
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 | col_vals_not_null() Expect that all values in | ✓ | 891 | 891 1.00 | 0 0.00 | ○ | ○ | ○ | — | |||
#4CA64C | 2 | col_exists() Expect that column | ✓ | 1 | 1 1.00 | 0 0.00 | ○ | ○ | ○ | — | |||
#4CA64C | 3 | col_exists() Expect that column | ✓ | 1 | 1 1.00 | 0 0.00 | ○ | ○ | ○ | — | |||
#4CA64C | 4 | col_vals_between() Expect that values in | ✓ | 891 | 891 1.00 | 0 0.00 | ○ | ○ | ○ | — | |||
#4CA64C | 5 | col_vals_in_set() Expect that values in | ✓ | 891 | 891 1.00 | 0 0.00 | ○ | ○ | ○ | — | |||
#EBBC14 | 6 | col_vals_lt() Expect that values in | ✓ | 891 | 865 0.97 | 26 0.03 | ● | ● | ○ | — | |||
2025-06-08 14:02:15 UTC< 1 s2025-06-08 14:02:15 UTC |
The header provides the validation label, table type and name, and thresholds.
For each step, the validation rules include the assertion type and brief
(STEP
), column name (COLUMNS
), and applicable values (VALUES
). The
validation results include the test unit count (UNITS
), passing
count/proportion (PASS
), failing count/proportion (FAIL
), as well as warning
(W
), error (E
) and critical (C
) thresholds. The final extracts (EXT
)
column is not yet supported for DuckDB.
The validation results report can also be retreived as a JSON-formatted string or Python dictionary:
import json
json_report = validation.get_json_report()
dict_report = json.loads(json_report)
This is especially useful in data pipelines and scheduled data quality checks. You can log the report, send a Slack message, continue or fail the pipeline run, etc.
Pointblank has many other great features that I haven’t mentioned here. To learn more, take a look at the Pointblank user guide.
Thanks for reading and don’t forget to validate your data!