Introduction to Big Data Concepts
- Big Data introduction
- OLTP vs OLAP
- SQL vs NoSQL
- Data Warehouses vs Data Lakes
- Batch vs Streaming processing’s
Apache Spark Programming Essentials – Python Basics
- Python fundamentals (syntax, variables, data types)
- Control flow (if, loops)
- Functions and modules
- Collections (list, tuple, set, dictionary)
- File handling basics
- Python vs Pandas vs Spark overview
Spark SQL & DataFrame Analytics – SQL Basics
- Relational database concepts
- SQL data types
- SELECT, WHERE, ORDER BY
- GROUP BY, HAVING
- JOIN types (Inner, Left, Right, Full)
- Subqueries & CTEs
- Basic indexing concepts
Spark SQL & DataFrame Analytics (Advanced)
- DataFrame operations (select, filter, withColumn)
- Aggregations & groupBy
- Joins in Spark
- Window functions
- UDFs & performance considerations
- Temporary & Global views
- Data exploration using Spark SQL
Azure Data Lake & Cloud Storage Foundations
- Azure Storage overview
- Azure Data Lake Storage Gen1 vs Gen2
- Blob Storage vs ADLS
- Hierarchical namespace
- Access control (ACLs & RBAC)
- Storage account configuration
- Data organization (Bronze / Silver / Gold)
- Accessing ADLS using Databricks & ADF
Data Integration with Azure Data Factory
- Understand Azure Data Factory
- Describe data integration patterns
- Explain the data factory process
- Understand Azure Data Factory components
- Azure Data Factory security
- Set up Azure Data Factory
- Create Linked Services
- Create Datasets
- Create Data Factory activities and pipelines
- Manage Integration Runtimes
- Data integration with Azure Data Factory
- Code-free transformation at scale with Azure Data Factory
Data Transformation with Azure Data Factory
- Transform data using Azure Data Factory
- Execute code-free transformations at scale
- Create pipelines to import poorly formatted CSV files
- Create Mapping Data Flows
- Data cleansing and standardization
- Join, aggregate, derive and conditional transformations
- Debugging & monitoring data flows
Azure Synapse & SQL Data Warehousing
- Azure Synapse workspace overview
- Synapse architecture
- Serverless SQL Pool vs Dedicated SQL Pool
- Data warehousing concepts
- Star & Snowflake schema design
- COPY INTO & PolyBase
- Distribution, partitioning & performance tuning
Introduction to Azure Databricks & Lakehouse
- Databricks workspace architecture
- Lakehouse architecture concepts
- Databricks vs ADF vs Synapse (use cases)
- Databricks components overview
- Cost and performance considerations
Databricks Workspace, Clusters & Notebooks
- Databricks workspace UI deep dive
- Cluster architecture
- Cluster types & autoscaling
- Notebook types (Python, SQL, Scala)
- Job scheduling
- Databricks REST API & CLI
- Git integration using Databricks Repos
Data Ingestion Techniques for the Lakehouse
- Ingesting CSV, JSON, XML, Parquet
- Mounting ADLS & Blob storage
- Auto Loader (cloudFiles)
- Schema inference & evolution
- Streaming ingestion basics
- Optimizing ingestion for high-volume data
Data Management, Governance & Unity Catalog
- DBFS vs External tables
- Metastore concepts
- Unity Catalog architecture
- Catalogs, schemas, tables
- Data access permissions
- Lineage & auditing
- Securing enterprise data access
Advanced Data Processing with Spark
- Complex transformations
- Handling nulls and corrupt records
- Schema evolution strategies
- Nested & semi-structured data
- Exploding arrays and structs
Databricks Utilities, Widgets & Automation
- dbutils (file system, secrets, jobs)
- Secret scopes & Key Vault integration
- Notebook widgets
- Parameterized notebooks
- Job orchestration
- Operational best practices
Delta Lake Architecture & Operations
- Delta Lake fundamentals
- ACID transactions
- Delta logs & versioning
- Schema enforcement & evolution
- Time travel & rollback
- OPTIMIZE & ZORDER
- VACUUM & retention management
LakeFlow & Modern Data Orchestration
- LakeFlow overview
- Delta Live Tables (DLT)
- Auto Loader integration
- Pipeline orchestration
- Monitoring & data quality expectations
- Event-driven architectures
Real-Time Streaming with Structured Streaming
- Streaming fundamentals
- Structured Streaming architecture
- Event Hubs & Kafka integration
- Stateful vs stateless processing
- Watermarking & late data handling
- Streaming Delta tables
- Fault tolerance & checkpointing
Power BI Integration
- Connecting Power BI to Databricks SQL Warehouse
- Import vs DirectQuery
- Performance optimization for BI
- Using Delta tables for analytics
- Dataset refresh strategies
- Publishing dashboards
Terraform for Databricks Automation
- Infrastructure as Code fundamentals
- Terraform basics
- Azure & Databricks providers
- Automating clusters, jobs, notebooks
- State management
- CI/CD best practices
Databricks Performance Optimization
- Understanding Spark UI
- Identifying performance bottlenecks
- Avoiding data skew
- Shuffle optimization
- Caching & broadcast joins
- Z-Ordering & file compaction
- Cluster sizing & DBR selection
- Cost optimization best practices