← Back to tools

Data Cleaning

Upload qualitative data files for cleaning. Removes timestamps, filler words, and personally identifying information (PII) — pattern-matching catches structured identifiers, then AI review catches names, locations, and contextual identifiers. How it works →

Drag files here or

.docx, .txt, .csv, .pdf — up to 20 MB each, 30 files max

Options Remove timestamps from transcripts Remove filler words (um, uh, hmm) AI de-identification recommended Anonymize organization names

AI de-identification uses a language model to catch names, locations, job titles, and identifying combinations that pattern-matching cannot detect. Text is processed on EU servers with zero data retention. Uncheck "Anonymize organization names" if your analysis needs specific organization and program names.

Before you start

.docx — Word documents (text is extracted automatically)
.txt — Plain text files
.csv — CSV files (you'll choose which columns to clean)
.pdf — PDF documents (text is extracted automatically)

Not supported: Excel (.xlsx) — export as CSV first (File → Save As → CSV). Scanned or image-based PDFs cannot be processed — use a text-based PDF or convert to .docx first.

Time estimate: Pattern scanning is near-instant. AI de-identification adds 1–2 minutes per file.

How your data is handled

Step 1 — Pattern scan: Regex catches phone numbers, emails, postal codes, SINs, URLs, social media handles, and long ID numbers.

Step 2 — AI review: A language model reviews the text for names, initials, locations, organisation names, job titles, and identifying combinations of details. Processed on EU servers with zero data retention.

Always review cleaned files before sharing. Automated de-identification is a strong first pass, not a replacement for human review.