iPAS Study Notes - AI Application Planner
TLDR
- AI Fundamentals: AI includes Machine Learning (ML) and Deep Learning (DL). Currently, the mainstream is "Narrow AI," while Artificial General Intelligence (AGI) has not yet been achieved.
- Data Engineering: Data Lakehouse combines the flexibility of a data lake with the governance capabilities of a data warehouse, serving as the foundation for modern AI projects.
- Data Processing: The Medallion Architecture (Bronze/Silver/Gold) is the best practice for layered data management; ELT architecture has gradually replaced traditional ETL due to increased cloud computing power.
- Feature Engineering: Categorical features require encoding based on cardinality and order (One-Hot, Target, WoE, etc.); numerical features require normalization (Z-score, Robust Scaling, etc.) to eliminate scale differences.
- Model Evaluation: When classes are imbalanced, Accuracy becomes misleading; priority should be given to F1-score, AUC-ROC, or MCC. For regression problems, choose MAE or RMSE based on sensitivity to outliers.
- Model Training: Overfitting can be mitigated through Dropout, regularization (L1/L2), Early Stopping, and cross-validation.
- Generative AI: RAG is suitable for knowledge base Q&A, while Fine-tuning is suitable for reinforcing specific styles or domain terminology; both can be used in combination.
- AI Governance: The EU AI Act adopts risk-based management; AI systems must focus on Bias, Robustness, and Explainability (XAI).
AI Fundamental Concepts
AI Capability Levels and Classification
Current commercial AI (such as the GPT series) belongs to "Narrow AI," characterized by a lack of autonomous goal setting, no persistent memory (unless assisted by RAG), and limited cross-domain transferability. AI functions can be categorized into four types: analytical, predictive, generative, and prescriptive.
Relationship between AI, Machine Learning, and Deep Learning
The three have a nested relationship (AI ⊃ ML ⊃ DL):
- AI: Refers generally to technologies that enable machines to exhibit intelligent behavior.
- ML: Learns patterns automatically through data, requiring manual feature design.
- DL: Uses multi-layer neural networks to automatically extract features, requiring large amounts of data.
Data Engineering
Data Storage Architecture
- Data Warehouse: Suitable for structured data; uses Schema-on-Write.
- Data Lake: Suitable for raw data storage; uses Schema-on-Read.
- Data Lakehouse: Combines the advantages of both, supporting ACID transactions and version control, suitable for reporting, ML, and RAG integration.
Data Processing: ETL vs ELT
Modern cloud platforms mostly adopt ELT, loading raw data into the lakehouse (Bronze layer) first and then transforming it as needed. This approach preserves raw details, which is beneficial for subsequent feature extraction by machine learning models.
Data Governance: Data Mesh
Data Mesh decentralizes data ownership to individual business domains, solving the scaling bottlenecks of centralized platforms through "Data as a Product" and "Federated Governance."
Feature Engineering
Categorical Feature Encoding
- One-Hot Encoding: Suitable for features with few categories and no inherent order.
- Ordinal Encoding: Suitable for categories with a clear order (e.g., education level).
- Target Encoding: Suitable for high-cardinality features, but must guard against Data Leakage (should use Leave-One-Out or Smoothing).
- WoE Encoding: Designed specifically for binary classification and fits naturally with Logistic Regression.
Numerical Feature Scaling
- Z-score: Suitable for scenarios where the distribution is stable and the algorithm is sensitive to scale.
- Robust Scaling: Uses the median and interquartile range, providing robustness against outliers.
- MaxAbs Scaling: Preserves the zero-value structure of sparse matrices, suitable for text features.
Machine Learning Model Evaluation
Selection of Evaluation Metrics
- Classification Problems: When classes are imbalanced, Accuracy is misleading; use F1-score or MCC instead. If the cost of false positives is high (e.g., spam), prioritize Precision; if the cost of false negatives is high (e.g., cancer screening), prioritize Recall.
- Regression Problems: If data contains outliers, MAE reflects true error better than RMSE.
Overfitting Prevention
- Early Stopping: Stop training when the loss on the validation set no longer decreases.
- Regularization: L1 (Lasso) can produce sparse weights, suitable for feature selection; L2 (Ridge) shrinks weights, improving model stability.
Generative AI and LLM
RAG vs Fine-tuning Decision
- RAG: Solves model knowledge cutoff dates and internal data citation issues; suitable for knowledge base Q&A.
- Fine-tuning: Solves the model's understanding of specific tones, formats, or domain terminology; suitable for deep customization of behavioral patterns.
Prompt Engineering Techniques
- Chain-of-Thought (CoT): Guides the model to break down reasoning steps.
- Tree-of-Thoughts (ToT): Models reasoning as a tree search, suitable for complex planning tasks.
- Prompt Chaining: Breaks large tasks into multiple steps to improve the stability of complex tasks.
AI Governance and Security
AI Risk Management
- EU AI Act: Adopts risk-based classification; high-risk applications (e.g., healthcare, recruitment) must establish risk management systems and human oversight mechanisms.
- OWASP Top 10 for LLM: Focuses on preventing Prompt Injection, sensitive information leakage, and Excessive Agency.
- Privacy Protection: Differential Privacy, which protects personal data by injecting noise, is the gold standard for privacy; Federated Learning ensures raw data never leaves the local environment.
Explainability (XAI)
- SHAP: Allocates feature contributions through cooperative game theory, suitable for explaining individual predictions.
- LIME: Performs local approximation near samples, suitable for local explanation of black-box models.
- Grad-CAM: Visualizes the image regions the model focuses on, suitable for computer vision tasks.
Change Log: 2026-05-20 Initial document created.