Ressources
Retour

Vous avez économisé des centaines d'heures de processus manuels lors de la prévision de l'audience d'un jeu à l'aide du moteur de flux de données automatisé de Domo.

Regardez la vidéo
À propos
Retour
Récompenses
Recognized as a Leader for
31 consecutive quarters
Leader du printemps 2025 en matière de BI intégrée, de plateformes d'analyse, de veille économique et d'outils ELT
Tarifs

Data Cleaning in SQL: Examples & Techniques

3
min read
Thursday, October 23, 2025
Data Cleaning in SQL: Examples & Techniques

Raw data is rarely analysis-ready. Missing values, inconsistent formats, and duplicated records can derail even the best-designed dashboards or machine learning models. Research shows that analysts and data scientists spend up to 80 percent of their time preparing and cleaning data instead of analyzing it. While dedicated data preparation tools are out there, SQL remains one of the most powerful and widely used options for cleaning, transforming, and standardizing data.

In this article, we’ll show you:

  • What data cleaning in SQL means and why it’s essential.
  • Common data quality issues SQL can solve.
  • Key SQL techniques with practical code examples.
  • Tips for structuring repeatable, scalable cleaning workflows.
  • Challenges to anticipate and how to overcome them.
  • Why pairing SQL with modern BI platforms like Domo accelerates results.

Whether you’re a business analyst wrangling a single table or a data engineer maintaining enterprise pipelines, mastering SQL data cleaning techniques can dramatically improve data accuracy, speed, and trust.

Understanding data cleaning in SQL

Data cleaning (also called data cleansing or data scrubbing) is the process of detecting and fixing errors, inconsistencies, and inaccuracies in your data sets so you can use them for reliable analysis

For decades, SQL (Structured Query Language) has served as the backbone of relational databases for decades. It’s fast, expressive, and nearly universal, making it a natural choice for transforming messy raw data into structured, analytics-ready tables.

In contrast to manual spreadsheet fixes or ad hoc scripts, SQL queries can be written once then easily reused, integrated into ETL pipelines, and run at scale across millions or even billions of rows.

Why SQL remains a go-to for data cleaning

  • Universality: Works across data warehouses such as Snowflake, BigQuery, Redshift, and on-premises systems.
  • Performance: Optimized for large set-based operations.
  • Maintainability: Queries can be stored, versioned, and scheduled.
  • Integration: Fits easily into BI and analytics pipelines.

Common data quality issues SQL can fix

Data Issue Symptoms SQL-Based Fixes
Duplicate rows Inflated counts, incorrect metrics ROW_NUMBER(), DISTINCT, GROUP BY
Inconsistent formatting Mixed date formats, casing differences, irregular spacing CAST(), TRIM(), UPPER(), LOWER()
Missing or null values Broken calculations, misleading averages COALESCE(), conditional imputation
Outliers and anomalies Skewed means, unrealistic data points CASE WHEN, statistical filtering
Incorrect joins and orphan keys Incomplete data views, duplicates INNER JOIN, LEFT JOIN validation
Data type mismatches Query errors, incorrect sorting CAST(), CONVERT()
Inconsistent categorical values “US” vs “U.S.” vs “United States” CASE mapping tables
Trailing spaces and hidden characters Failing joins or grouping TRIM(), REPLACE()

Core SQL techniques and examples

Remove duplicates

Duplicate records can distort metrics, inflate totals, and confuse downstream analytics. SQL window functions make it easy to identify and keep the most relevant version of each record, such as the latest update, while removing unnecessary copies for cleaner, more reliable reporting. For example: 

WITH ranked AS (  
  SELECT *,    
    ROW_NUMBER() OVER (      
      PARTITION BY customer_id, order_date      
      ORDER BY updated_at DESC    
    ) AS row_num  
  FROM orders)
SELECT *
FROM ranked
WHERE row_num = 1;

Or simply:

SELECT DISTINCT customer_id, order_date, total
FROM orders;

Standardize text and casing

Inconsistent text formatting breaks grouping, joins, and reporting logic. Standardizing casing, trimming spaces, and correcting variations ensures your data matches cleanly, avoids duplication, and supports accurate analytics—especially in fields like country names, categories, or customer names.

SELECT  
  UPPER(TRIM(country)) AS country_clean,
  INITCAP(first_name) AS first_name_clean
FROM customers

Replace values:

UPDATE customers
SET country = REPLACE(country, 'U.S.A.', 'USA');

Handle missing or null values

Missing values can cause broken aggregations, misleading averages, or inaccurate dashboards. SQL lets you fill in defaults, apply business logic, or handle nulls gracefully to keep calculations consistent and reports trustworthy.

SELECT
  COALESCE(phone, 'Unknown') AS phone_filled,
  CASE WHEN total_spent IS NULL THEN 0 ELSE total_spent END AS total_spent_clean
FROM customers;

Convert data types

Incorrect or inconsistent data types, like dates stored as text or numbers saved as strings, can break filters and slow queries. Casting ensures your columns use the correct data types for accurate calculations, sorting, and time-based analysis.

SELECT
  CAST(order_date AS DATE) AS order_date_clean,
  CAST(sales_amount AS DECIMAL(10,2)) AS sales_clean
FROM orders;

Normalize dates and times

Date fields often come in inconsistent formats or time zones, making trend analysis challenging. SQL offers tools to standardize time zones, truncate to a desired granularity, and extract components like year or month for reliable time-series insights.

SELECT
  DATE_TRUNC('month', order_date) AS order_month,
  EXTRACT(YEAR FROM order_date) AS order_year
FROM orders;

Identify and remove outliers

Extreme or erroneous data points can skew averages, forecasts, and models. SQL allows you to flag or filter outliers using statistical rules such as standard deviations or percentiles, keeping analysis focused on meaningful, representative data.

WITH stats AS (
  SELECT AVG(sales_amount) AS mean, STDDEV(sales_amount) AS std_dev
  FROM sales
)
SELECT s.*
FROM sales s
CROSS JOIN stats
WHERE s.sales_amount BETWEEN (mean - 3 * std_dev) AND (mean + 3 * std_dev);

Validate relationships between tables

Broken joins, such as orders without matching customers, signal integrity issues and lead to incomplete or incorrect results. Simple SQL checks help you spot missing foreign keys or orphaned records so you can fix upstream data issues.

SELECT o.order_id
FROM orders o
LEFT JOIN customers c
  ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

Recode or map categories

Inconsistent categories (e.g., “US” vs “U.S.” vs “USA”) make grouping unreliable. SQL lets you standardize values with case logic or mapping tables, ensuring consistency and improving the accuracy of reports and downstream models.

SELECT
  CASE
    WHEN country IN ('US', 'USA', 'U.S.') THEN 'USA'
    WHEN country IN ('UK', 'United Kingdom', 'GB') THEN 'UK'
    ELSE country
  END AS country_clean
FROM customers;

Combine cleaning steps with CTEs

Complex cleaning tasks often involve multiple transformations. Common table expressions (CTEs) help you break these into readable, modular steps, making the process easier to debug, maintain, and scale without building one massive query.

WITH trimmed AS (
  SELECT customer_id, TRIM(email) AS email_clean
  FROM customers
),
valid AS (
  SELECT * FROM trimmed WHERE email_clean LIKE '%@%'
)
SELECT DISTINCT email_clean
FROM valid;

Best practices for scalable SQL data cleaning

Profile your data first

Before cleaning, understand what you’re working with. Use counts, distinct values, and simple aggregates to reveal missing fields, unexpected outliers, or duplicate records. Profiling gives you a data health snapshot so you can target fixes efficiently instead of guessing where problems exist.

SELECT COUNT(*) AS total_rows,
       SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS missing_email
FROM customers;

Use CTEs and modular queries

Complex cleaning can quickly become unmanageable in a single SQL statement. Breaking transformations into common table expressions (CTEs) makes each step clear, testable, and easy to maintain. Modular queries also help teammates understand your logic and allow future edits without rewriting everything.

Treat cleaning scripts like code

Your SQL isn’t just a one-off; it’s part of a production pipeline. Version-control scripts in Git, document assumptions, and comment transformations so others know why each change exists. Treating queries like code builds transparency, prevents accidental regressions, and creates a reliable data foundation.

Automate and schedule

Manual execution doesn’t scale. Schedule SQL cleaning jobs using orchestration tools like Airflow, dbt, or your database’s built-in schedulers. Automation ensures consistent, repeatable data hygiene, reduces human error, and keeps dashboards and analytics current without requiring someone to remember to run scripts.

Monitor for schema drift

Data sources evolve: columns get renamed, data types shift, and new fields appear. Schema drift can silently break your cleaning workflows. Set up automated checks or alerts to detect changes early, helping you adapt queries and maintain clean, reliable pipelines over time.

Keep business experts in the loop

Not every “bad” value is wrong. Some outliers or unusual formats are valid in context. Collaborate with domain experts in finance or operations to confirm cleaning rules. Their input ensures your transformations preserve meaning while improving consistency and accuracy.

Challenges and how to overcome them

Even with strong SQL skills and thoughtful workflows, data cleaning can be harder than it looks. Large data sets, evolving schemas, and ambiguous business rules often introduce unexpected complexity. Anticipating these common challenges will help you design cleaning processes that stay reliable and scalable over time.

  • Hidden data quality problems. Some issues, like unexpected nulls, inconsistent foreign keys, or silent type changes, may go unnoticed until reports break. Profile your data early and use anomaly detection queries to catch issues before they cause downstream errors.
  • Accidentally dropping valid outliers. Not every unusual value is wrong. Record-breaking sales or rare customer scenarios may be legitimate. Confirm with subject matter experts (SMEs) before filtering outliers.
  • Maintenance overhead. SQL scripts can become fragile as they grow. Modularize with CTEs, document logic, and version-control queries to keep pipelines maintainable.
  • Scaling limits. Large data sets can strain performance. Use indexes, partitions, clustering, and warehouse optimizations to keep queries fast and efficient.
  • Complex business logic. Some rules exceed SQL’s capabilities. Pair SQL with Python, dbt, or ETL tools for more advanced transformations and flexible workflows.

By planning for these pitfalls and building maintainable pipelines, you can reduce surprises, preserve data accuracy, and scale your cleaning strategy as data complexity grows.

When to go beyond SQL alone

SQL is a powerful foundation for data cleaning, especially when you’re looking for deterministic, rule-based transformations that can be reused and audited. However, as data ecosystems grow in complexity, relying solely on hand-written SQL scripts can become a bottleneck. Modern data environments are more dynamic than ever, with constantly changing schemas, new data sources, and business users who need access without deep coding knowledge. At some point, scaling your cleaning strategy requires tools and capabilities that go beyond SQL alone.

  • Machine learning-driven anomaly detection and imputation. Traditional SQL rules can flag known issues but struggle with subtle or emerging data errors. Machine learning models can detect patterns, spot unexpected anomalies, and intelligently fill in missing values based on context rather than static defaults.
  • Real-time adaptability to new sources or streaming data. SQL workflows often run in batches. When you handle streaming IoT data, rapid SaaS feed changes, or constantly evolving partner integrations, platforms with event-driven pipelines and dynamic schema handling provide the agility manual SQL can’t match.
  • Visual, no-code workflows for non-technical users. Many organizations want analysts and business stakeholders to participate in data prep. Drag-and-drop interfaces with SQL under the hood let non-technical users contribute to cleaning and validation without writing code.
  • Built-in governance, lineage, and auditability. SQL scripts can be tracked in version control, but enterprise-grade governance—including end-to-end lineage, approval workflows, and compliance-friendly audit logs—usually requires a dedicated platform.

By adding these advanced capabilities, organizations can scale beyond static scripts, reduce manual maintenance, and support a wider range of data consumers. Modern data preparation platforms complement SQL by automating routine tasks, enabling collaboration, and providing the transparency and control for enterprise-grade analytics.

Why Domo accelerates SQL-driven cleaning

Domo lets you keep the flexibility of SQL while adding AI-powered automation, governance, and scalability:

  • AI-driven quality checks: Spot anomalies and inconsistencies faster, using machine learning to detect errors traditional SQL rules might miss.
  • Visual, reusable workflows: Build drag-and-drop preparation steps or embed custom SQL directly for advanced control.
  • Real-time schema adaptation: Automatically adjusts when fields or source structures change, reducing pipeline breaks.
  • End-to-end governance: Full lineage tracking, version control, and audit trails help maintain compliance and trust.
  • Integrated analytics: Clean, transform, and analyze in one environment—without moving data between tools.

Beyond simplifying routine cleaning tasks, Domo empowers both technical teams and business users to collaborate effortlessly. You can combine SQL-driven logic with low-code workflows, automate repetitive processes, and scale data preparation without adding headcount. Its cloud-native architecture ensures performance as data volume grows, while built-in transparency keeps stakeholders confident in the results.

Instead of juggling scripts, schedulers, and separate BI tools, Domo provides a single, unified platform where teams can profile, clean, transform, and visualize data at enterprise scale.

Ready to move from manual SQL clean-up to intelligent, automated data preparation? Contact Domo to see how our platform can help you clean and transform data faster, reduce maintenance, and unlock reliable insights.

Author

Read more about the author
No items found.
No items found.
Explore all

Domo transforms the way these companies manage business.

No items found.
Data Quality