Nairobi, Kenya

254728269396

Data Engineering With Databricks Training

In the era of big data and cloud computing, mastering Data Engineering with Databricks is a crucial skill for building scalable, reliable, and efficient data pipelines that power modern analytics and...

Click to Register

ONSITE OR VIRTUAL

Programme Overview
Training Description

Who Should Attend

This course is ideal for;

  1. Data Engineers and ETL Developers
  2. Data Scientists working with large datasets
  3. Cloud Architects and DevOps Engineers
  4. Business Intelligence (BI) Developers
  5. Database Administrators (DBAs)
  6. IT Managers and Technical Leaders
  7. Analytics Professionals
  8. Software Engineers with an interest in data infrastructure
  9. Anyone involved in designing, building, or managing data pipelines on the cloud.
Session Objectives
  • Understand the Databricks Lakehouse Platform and its core components.
  • Learn about the Delta Lake format and its advantages for data engineering.
  • Acquire skills in using Apache Spark for large-scale data processing in Databricks.
  • Comprehend techniques for building reliable and scalable ETL/ELT pipelines.
  • Explore strategies for managing data quality and data governance with Databricks.
  • Understand the importance of structured streaming for real-time data ingestion.
  • Gain insights into orchestrating and automating data workflows using Databricks Jobs.
  • Develop a practical understanding of CI/CD practices for Databricks projects.
  • Master the use of Databricks notebooks, Repos, and the collaborative workspace.
  • Acquire skills in optimizing Spark jobs and managing cluster resources.
  • Learn to apply best practices for security and access control in Databricks.
  • Comprehend techniques for migrating existing data pipelines to the Databricks platform.
  • Explore strategies for MLOps and preparing data for machine learning.
  • Understand the importance of using Databricks SQL for business analytics.
  • Develop the ability to lead and implement production-ready Data Engineering with Databricks solutions.
About the Course

In the era of big data and cloud computing, mastering Data Engineering with Databricks is a crucial skill for building scalable, reliable, and efficient data pipelines that power modern analytics and machine learning applications, as the Databricks Lakehouse platform unifies data warehousing and data lakes to simplify complex data architectures. This specialized knowledge is essential for data engineers seeking to streamline Extract, Transform, Load (ETL) processes, ensure data quality, and accelerate data delivery to business stakeholders. This comprehensive training course is meticulously designed to equip data engineers, data scientists, and cloud architects with the advanced knowledge and practical strategies required to leverage Databricks' full potential, from building a robust Delta Lake foundation to orchestrating complex data workflows and implementing production-grade data solutions. Without robust expertise in Data Engineering with Databricks, organizations risk inefficient data processing, data quality issues, and a fragmented data ecosystem that hinders innovation and timely decision-making, underscoring the vital need for specialized expertise in this critical domain.

Curriculum & Topics

15 Topics | 10 Days

  • play Subtopic 1.1: Overview of the Lakehouse architecture and its benefits.

  • play Subtopic 1.2: The Databricks workspace: notebooks, clusters, and the Web UI.

  • play Subtopic 1.3: Core components: Delta Lake, Apache Spark, and Databricks Runtime.

  • play Subtopic 1.4: The role of Databricks in the modern data stack.

  • play Subtopic 1.5: Setting up a Databricks environment and workspace.

  • play Subtopic 2.1: Introduction to Delta Lake: what it is and why it's important.

  • play Subtopic 2.2: Key features: ACID transactions, time travel, and schema enforcement.

  • play Subtopic 2.3: Creating, reading, and writing to Delta tables.

  • play Subtopic 2.4: Using Delta Lake for data versioning and auditing.

  • play Subtopic 2.5: Optimizing Delta tables: VACUUM and OPTIMIZE commands.

  • play Subtopic 3.1: Spark architecture review: RDDs, DataFrames, and the Catalyst Optimizer.

  • play Subtopic 3.2: Using Spark SQL for data manipulation and querying.

  • play Subtopic 3.3: Core Spark transformations and actions.

  • play Subtopic 3.4: Partitioning data for performance.

  • play Subtopic 3.5: Using Databricks notebooks for interactive Spark development.

  • play Subtopic 4.1: The medallion architecture: Bronze, Silver, and Gold tables.

  • play Subtopic 4.2: Designing and implementing a simple ETL pipeline.

  • play Subtopic 4.3: Reading data from various sources (S3, ADLS, JDBC).

  • play Subtopic 4.4: Transforming data using Spark DataFrames.

  • play Subtopic 4.5: Writing clean, transformed data to Delta tables.

  • play Subtopic 5.1: Introduction to Structured Streaming for real-time data.

  • play Subtopic 5.2: Reading data from streaming sources (Kafka, Kinesis).

  • play Subtopic 5.3: Performing stateful and stateless transformations on streams.

  • play Subtopic 5.4: Writing stream-processed data to Delta tables.

  • play Subtopic 5.5: Monitoring and managing streaming jobs.

  • play Subtopic 6.1: Using MERGE statements for Upserts and Slowly Changing Dimensions (SCD).

  • play Subtopic 6.2: Handling common data quality issues and data validation.

  • play Subtopic 6.3: Implementing data lineage and data cataloging.

  • play Subtopic 6.4: Error handling and fault tolerance in ETL pipelines.

  • play Subtopic 6.5: Best practices for production-ready ETL code.

  • play Subtopic 7.1: Data quality frameworks and their implementation.

  • play Subtopic 7.2: Schema evolution and enforcement in Delta Lake.

  • play Subtopic 7.3: Using Databricks Unity Catalog for unified governance.

  • play Subtopic 7.4: Managing access control and permissions for data and notebooks.

  • play Subtopic 7.5: Auditing and compliance features in the Databricks platform.

  • play Subtopic 8.1: Introduction to Databricks Jobs for scheduling and orchestration.

  • play Subtopic 8.2: Creating multi-task workflows with dependencies.

  • play Subtopic 8.3: Monitoring job runs and setting up notifications.

  • play Subtopic 8.4: Parameterizing notebooks and jobs for reusability.

  • play Subtopic 8.5: Integrating with external orchestrators like Airflow and Azure Data Factory.

  • play Subtopic 9.1: The role of data engineers in the machine learning lifecycle.

  • play Subtopic 9.2: Feature engineering and data preprocessing for models.

  • play Subtopic 9.3: Creating feature tables with Delta Lake.

  • play Subtopic 9.4: Using Databricks AutoML for feature selection.

  • play Subtopic 9.5: Serving data to ML models from the Lakehouse.

  • play Subtopic 10.1: Optimizing Spark configurations for job performance.

  • play Subtopic 10.2: Troubleshooting slow Spark jobs using the Spark UI.

  • play Subtopic 10.3: Best practices for data layouts and file sizes.

  • play Subtopic 10.4: Caching and broadcasting for performance gains.

  • play Subtopic 10.5: Techniques for cost optimization and resource management.

  • play Subtopic 11.1: Securing the Databricks workspace and clusters.

  • play Subtopic 11.2: Identity management and single sign-on (SSO).

  • play Subtopic 11.3: Networking and private link configurations.

  • play Subtopic 11.4: Monitoring cluster health and performance.

  • play Subtopic 11.5: CI/CD practices for Databricks notebooks and projects.

  • play Subtopic 12.1: Introduction to Databricks SQL: what it is and who it's for.

  • play Subtopic 12.2: Using Databricks SQL warehouses for high-performance queries.

  • play Subtopic 12.3: Connecting BI tools (Tableau, Power BI) to Databricks.

  • play Subtopic 12.4: Creating dashboards and visualizations with Databricks SQL.

  • play Subtopic 12.5: Governance and security for SQL analytics.

  • play Subtopic 13.1: Strategies for migrating on-premise or legacy data warehouses.

  • play Subtopic 13.2: Migrating ETL jobs from other platforms (e.g., Hive, Informatica).

  • play Subtopic 13.3: Tools and scripts for data migration.

  • play Subtopic 13.4: Planning for a seamless transition and minimizing downtime.

  • play Subtopic 13.5: Case studies of successful migration projects.

  • play Subtopic 14.1: Using Databricks Repos for Git integration and version control.

  • play Subtopic 14.2: Databricks Connect for developing locally.

  • play Subtopic 14.3: The Databricks APIs for programmatic access.

  • play Subtopic 14.4: Managing ML models and experiments with MLflow.

  • play Subtopic 14.5: The Databricks Marketplace and solutions.

  • play Subtopic 15.1: Participants work in teams on a real-world data engineering project.

  • play Subtopic 15.2: Exercise: Design a Lakehouse architecture for the project.

  • play Subtopic 15.3: Build an ETL pipeline from raw data to a Gold table.

  • play Subtopic 15.4: Orchestrate the pipeline using Databricks Jobs.

  • play Subtopic 15.5: Present the final project and discuss key design decisions.

img

$ 2,000

Availability Calendar

Find a schedule that works for you. Click any available session to submit a booking.

Selected Session:
Delivery modes & Locations
This Programme Includes

Certificate of completion

Training manual

Reference materials

10 o'clock tea

Lunch

4 o'clock tea

Course Highlights
  • icon 10 Days Intensive Training

  • icon 15 Core Learning Topics

  • icon 10 Days Professional Sessions

  • icon Training Expert-led Delivery

PB Training Institute of Research and Consultancy
FAQs

Frequently Asked Questions

Explore detailed answers to the most common questions about our platform and services.

No questions available at the moment.