Risorse
Indietro

Hai risparmiato centinaia di ore di processi manuali per la previsione del numero di visualizzazioni del gioco utilizzando il motore di flusso di dati automatizzato di Domo.

Guarda il video
A packed indoor basketball arena with a large scoreboard hanging above the court showing game information.
Chi siamo
Indietro
Premi
Recognized as a Leader for
32 consecutive quarters
Primavera 2025, leader nella BI integrata, nelle piattaforme di analisi, nella business intelligence e negli strumenti ELT
Prezzi

Data Cleaning With Automation: How One BI Director Eliminated Manual Mapping Work

Grant Stowell

Field & Partner Marketing Specialist

4
0
min read
Monday, May 18, 2026
Cleaning data with automation: Chaos cleanup—Magic ETL AI/Py

Excel lookup tables have a breaking point, and if you're managing data from multiple sources, you've probably hit it.

Mark Snodgrass, director of business intelligence at NLC Mutual, faced exactly this challenge while standardizing claims data across more than 20 separate instances. In the Domopalooza 2026 breakout session Cleaning up Chaos with Magic ETL, AI, and Python, he shared how cleaning data with automation transformed his workflow. Instead of maintaining brittle spreadsheets that broke every time a new state or code appeared, he embedded AI-assisted Python directly into his ETL pipeline. The result? Days of manual work eliminated each month, plus more accurate categorization than his old approach ever delivered.

Whether you're wrangling insurance claims, customer records, or any other messy categorical data, the patterns Mark shared apply broadly. Here's what stood out.

Replace manual lookups with AI-assisted categorization

Brittle, manual lookup tables can be replaced with AI-assisted categorization embedded directly in your ETL pipeline. Mark described starting with an Excel-based lookup table to standardize claim cause codes across states, but it became unreliable and difficult to maintain as new states and new codes appeared. He then used AI Chat to help build and refine a Python script placed inside Magic ETL so categorization could be automated and maintained in-pipeline rather than externally.

"Initially, I was using an Excel file to create kind of a lookup table of, hey, here's different cost codes that were coming in from these states. Here's my standardized name that I want to use for it," Mark explained. "So thankfully, AI came along to help me standardize those categories."

The shift wasn't just about convenience. "I ended up using AI Chat to develop a Python script, put that in the Magic ETL," he said. "And so that took just days off of my day, or my month."

Here's how you can apply this pattern to your own data cleaning challenges:

  1. Start with a specific categorization problem: Pick one field or set of fields where inconsistent values cause downstream issues.
  2. Use AI Chat tools to generate initial Python logic: You don't need to write the script from scratch. Describe your problem and let AI suggest an approach.
  3. Embed the script in your ETL pipeline: Running categorization inside your transformation workflow keeps everything in one place and eliminates external file dependencies.
  4. Iterate until confident: Refine the script by reviewing outputs, adjusting logic, and re-running until the categories align with your expectations.

The key insight here is here is moving categorization logic from fragile external files into your pipeline where it can scale and adapt.

Improve accuracy with multiple fields and rule-based overrides

Categorization accuracy improves dramatically when you use multiple fields, weight important fields, and combine ML-style text processing with rule-based overrides. Mark emphasized that his categorization improved when he moved from using a single field to using both "coverage subtype" and "loss cause." He also described weighting one field by listing it twice, using vectorization to emphasize meaningful terms, and adding explicit rule-based overrides (e.g., hurricane + flood) to catch known edge cases.

This hybrid approach—part machine learning, part explicit business logic—creates a more accurate automated cleaning system than either method alone.

Consider these techniques when building your own categorization logic:

  • Concatenate multiple fields: Combining two or three relevant columns gives the algorithm more context to work with.
  • Weight important fields: If one field is more reliable or meaningful, include it multiple times in your concatenated string.
  • Use vectorization to emphasize unique terms: Text processing techniques like TF-IDF reduce noise from common words (like "the" or "and") while highlighting distinctive terms.
  • Add rule-based overrides for known edge cases: When you know that certain combinations should always map to specific categories, encode that logic explicitly rather than hoping the algorithm figures it out.

The beauty of this approach is that it handles the 90 percent of cases that follow patterns while giving you explicit control over the tricky 10 percent.

Normalize join keys to prevent matching failures

Joins become more reliable during automated cleaning when you normalize multi-column keys into a single standardized join string. Mark concatenated three join columns (line of business, coverage subtype, and loss cause), trimmed them, and lowercased them on both sides. This simplified the join and made it "more sure" even when blanks or inconsistent formatting might otherwise break matching.

Join tiles can be surprisingly sensitive to case differences, extra spaces, and blank values. When you're joining the output of an automated categorization process back to your main dataset, these small inconsistencies can cause records to drop or duplicate.

Here's the practical implementation pattern:

  • Concatenate all join columns into a single string: This reduces the number of join conditions and simplifies troubleshooting.
  • Trim whitespace: Leading and trailing spaces are invisible but cause match failures.
  • Convert to lowercase: Case sensitivity catches people off guard, especially with data from multiple sources.
  • Apply the same transformation on both sides: The source dataset and the lookup dataset need identical normalization logic.

This technique works in any ETL tool, not just the one Mark used. The principle is universal: reduce join complexity and eliminate formatting variations before they cause problems.

Put automated data cleaning into practice

Mark's approach demonstrates that cleaning data with automation doesn't require a massive infrastructure investment. You can start with a single problematic field, use AI to help generate initial logic, and iterate until the output meets your standards. The time savings can compound quickly. You'll find that what once took days of manual spreadsheet maintenance now runs automatically as part of your regular data refresh.

The patterns here transfer across tools and industries. Whether you're standardizing product categories, normalizing customer segments, or aligning regional codes, the core principles remain: embed logic in your pipeline, use multiple fields for context, add explicit overrides for edge cases, and normalize your join keys.

Ready to see the full walkthrough, including the specific Python scripts and workflow configurations? Watch the complete session from Domopalooza 2026 to get all the implementation details.

No items found.
Table of contents
Carrot arrow icon
Tags
No items found.
No items found.
Explore all
No items found.
No items found.