π« The Pain Point
Your data has inconsistent text: β JOHN SMITH β vs βjohn smithβ vs βJohn Smithβ. Before processing, you need everything standardized.
π Agentic Solution
A Text Cleaner that applies consistent formatting rules.
Key Features:
- Whitespace Cleanup: Remove extra spaces, trim edges.
- Case Normalization: UPPER, lower, Title Case.
- Unicode Normalization: NFC/NFD forms for Vietnamese.
βοΈ Phase 1: Commander (Quick Fix)
For quick normalization.
Prompt:
βI have an Excel
data.xlsxwith text columns. Write a Python script to:
- Trim: Remove leading/trailing whitespace.
- Collapse: Multiple spaces to single space.
- Unicode: Normalize to NFC form.
- Case: Apply Title Case to βNameβ column.
- Output: Save as
data_normalized.xlsx.Print sample before/after.β
Result: Clean, consistent text data.
ποΈ Phase 2: Architect (Permanent Tool)
Engineering Prompt:
**Role:** Python Tool Developer
**Task:** Create a "Text Normalizer".
**Requirements:**
1. **GUI:**
* Select Excel file.
* Column selector (apply to which columns).
* Rule checkboxes: Trim, Collapse spaces, Case (dropdown).
* Unicode form dropdown (NFC, NFD).
* Preview changes.
2. **Logic:**
* String manipulation with regex.
* unicodedata.normalize().
* Handle None/NaN values.
3. **Deliverables:**
* `text_normalizer.py`
* `run.bat`, `run.sh`
* `requirements.txt`
π§ Prompt Decoding
- Unicode Normalization: Vietnamese characters can be composed differently. NFC is preferred for web.
π οΈ Instructions
- Copy Prompt β Run.