July 31, 2023 1 min read

File Processing with AI: Automating Data Workflows

AI File Processing Data Engineering Automation

AI assistants can help process various file types, from data conversion to content extraction. Today we’ll explore file processing patterns using AI tools.

File Processing Capabilities

# What AI can do with files
capabilities = {
    "data_files": {
        "csv": "Read, analyze, transform, export",
        "excel": "Multi-sheet processing, formulas extraction",
        "json": "Parse, flatten, restructure",
        "parquet": "Read and convert",
        "xml": "Parse and extract data"
    },
    "documents": {
        "pdf": "Extract text and tables",
        "text": "Parse, clean, structure",
        "markdown": "Generate and format"
    },
    "images": {
        "charts": "Digitize data from chart images",
        "screenshots": "Extract information",
        "diagrams": "Describe and analyze"
    }
}

Pattern 1: Format Conversion

CSV to JSON

**Prompt:**
"I'm uploading a CSV file with customer data.
Please:
1. Convert it to JSON format
2. Use camelCase for keys (customer_id -> customerId)
3. Group records by region into nested structure
4. Validate that all required fields are present
5. Provide the JSON file for download"

Excel to CSV

**Prompt:**
"Process this Excel file:
- Sheet 'Q1 Sales' contains data
- Skip the first 3 rows (headers start at row 4)
- Columns A-F contain the relevant data
- Convert date columns to YYYY-MM-DD format
- Replace empty cells with 'N/A'
- Export as UTF-8 CSV"

Flattening Nested JSON

**Prompt:**
"I have a JSON file with nested structure:
{
  'orders': [
    {
      'id': 1,
      'customer': {'name': '...', 'address': {...}},
      'items': [{'product': '...', 'qty': 1}, ...]
    }
  ]
}

Please:
1. Flatten to tabular format
2. One row per order-item combination
3. Use dot notation for nested fields (customer.name)
4. Export as CSV"

Pattern 2: Data Cleaning Automation

Standardization

**Prompt:**
"Clean this customer data file:

1. Names:
   - Convert to Title Case
   - Remove extra whitespace
   - Handle 'Mr.', 'Mrs.', etc. prefixes

2. Phone Numbers:
   - Standardize to format: (XXX) XXX-XXXX
   - Mark invalid numbers as 'INVALID'

3. Addresses:
   - Expand abbreviations (St -> Street, Ave -> Avenue)
   - Standardize state codes

4. Emails:
   - Convert to lowercase
   - Validate format
   - Flag potentially invalid emails

Provide a cleaned file and a summary of changes made."

Deduplication

**Prompt:**
"Find and handle duplicates in this customer file:

1. Identify exact duplicates (all fields match)
2. Identify potential duplicates:
   - Same email
   - Same name + phone
   - Similar names (fuzzy match)

3. For each duplicate set:
   - Keep the most recent record
   - Merge non-empty fields where possible

4. Provide:
   - Deduplicated file
   - List of removed duplicates with reason
   - Confidence score for fuzzy matches"

Pattern 3: Data Extraction

Extract Tables from PDF

**Prompt:**
"I'm uploading a PDF report.
Please:
1. Identify all tables in the document
2. Extract each table to a separate CSV
3. Preserve the header rows
4. Handle merged cells appropriately
5. Number the outputs (table_1.csv, table_2.csv, etc.)
6. Provide a summary of what was extracted from where"

Extract Data from Screenshots

**Prompt:**
"I'm uploading a screenshot of a data table from a web page.
Please:
1. Extract all visible data into structured format
2. Identify column headers
3. Handle any partially visible rows appropriately
4. Provide as CSV file
5. Note any extraction uncertainties"

Parse Semi-Structured Text

**Prompt:**
"I have a text file with product information in this format:
---
Product: Widget A
SKU: WID-001
Price: $29.99
In Stock: Yes
Description: A high-quality widget...
---
Product: Gadget B
...

Please:
1. Parse all products
2. Extract fields: product name, sku, price (as number), in_stock (as boolean), description
3. Convert to structured JSON
4. Flag any records with missing required fields"

Pattern 4: File Merging and Splitting

Merge Multiple Files

**Prompt:**
"I'm uploading 3 CSV files (sales_jan.csv, sales_feb.csv, sales_mar.csv).
Please:
1. Verify all have the same columns
2. Concatenate into a single file
3. Add a 'source_file' column to track origin
4. Remove any duplicate rows
5. Sort by date
6. Provide the combined file"

Split Large File

**Prompt:**
"I have a large CSV file that needs to be split:
1. Split by region (create one file per unique region)
2. Name files as 'sales_{region}.csv'
3. Keep headers in each output file
4. Provide a summary showing record counts per file"

Pattern 5: Data Validation

Schema Validation

**Prompt:**
"Validate this data file against these rules:

Required columns: id, name, email, created_date, amount
Data types:
  - id: positive integer
  - name: non-empty string
  - email: valid email format
  - created_date: date in YYYY-MM-DD format
  - amount: positive decimal

Business rules:
  - created_date cannot be in the future
  - amount must be between 0 and 10000
  - no duplicate ids

Provide:
1. List of validation errors with row numbers
2. Summary statistics (total rows, valid rows, error rate)
3. A cleaned file with only valid rows
4. A file with rejected rows and error reasons"

Pattern 6: Report Generation

Generate Report from Data

**Prompt:**
"Based on the uploaded sales data, generate:

1. Executive Summary Report (Word/Markdown):
   - Key metrics (total sales, growth, top products)
   - 3-5 key insights
   - Recommendations

2. Detailed Data Appendix (Excel):
   - Sheet 1: Monthly summary
   - Sheet 2: Category breakdown
   - Sheet 3: Top 20 products
   - Include appropriate formulas and formatting

3. Presentation Slides (Key points only):
   - Slide 1: Overview metrics
   - Slide 2: Trend analysis
   - Slide 3: Category performance
   - Slide 4: Recommendations

Provide all files for download."

Workflow Automation

Creating Processing Pipelines

**Prompt:**
"Create a data processing pipeline that:

1. Input: CSV file uploaded weekly (sales_YYYYMMDD.csv)

2. Validation:
   - Check expected columns exist
   - Validate data types
   - Flag quality issues

3. Transformation:
   - Calculate derived fields (total, profit margin)
   - Add week number and month columns
   - Categorize amounts (Low/Medium/High)

4. Output:
   - Processed file (processed_YYYYMMDD.csv)
   - Validation report (validation_YYYYMMDD.txt)
   - Summary statistics (summary_YYYYMMDD.json)

Please process the uploaded file and provide all outputs.
Also provide the Python code so I can automate this weekly."

Best Practices

best_practices = {
    "validation": "Always validate outputs against expectations",
    "backup": "Keep original files before transformation",
    "documentation": "Request processing steps be documented",
    "incremental": "For large files, process in chunks",
    "error_handling": "Ask for graceful handling of edge cases",
    "reproducibility": "Request code for repeatable processes"
}

This concludes our July series on Microsoft Fabric and AI topics. Tomorrow we begin August with LLM fine-tuning strategies.