Location
Jaipur
Experience
6-8 years
Industry
AI/Technology
Job Summary
Design, build, and optimize large-scale, production-grade data pipelines and analytics platforms on Azure, leveraging Databricks, Synapse, and the broader Microsoft data ecosystem. Deliver business-critical data assets for analytics, BI, and AI/ML initiatives.
Key Technical Responsibilities
- Architect modern data lakes using Azure Data Lake Storage Gen2 for batch and streaming workloads.
- Build and maintain scalable ETL/ELT pipelines using Azure Data Factory and Databricks (PySpark, Scala, SQL).
- Orchestrate data workflows across ADF, Databricks, and Synapse Pipelines; implement modular and reusable data pipeline components.
- Develop advanced notebooks and production jobs in Azure Databricks (PySpark, SparkSQL, Delta Lake).
- Optimize Spark jobs by tuning partitioning, caching, cluster configuration, and autoscaling for performance and cost.
- Implement Delta Lake for ACID-compliant data lakes and enable time travel and audit features.
- Engineer real-time data ingestion from Event Hubs, IoT Hub, and Kafka into Databricks and Synapse.
- Transform and enrich raw data, building robust data models and marts for analytics and AI use cases.
- Integrate structured, semi-structured, and unstructured data sources, including APIs, logs, and files.
- Implement data validation, schema enforcement, and quality checks using Databricks, PySpark, and tools like Great Expectations.
- Manage access controls using Azure AD, Databricks workspace permissions, RBAC, and Azure Key Vault integration.
- Enable end-to-end data lineage and cataloging via Microsoft Purview (or Unity Catalog in multi-cloud environments).
- Automate deployment of Databricks assets (notebooks, jobs, clusters) using Databricks CLI/REST API, ARM/Bicep, or Terraform.
- Build and manage CI/CD pipelines in Azure DevOps for data pipelines and infrastructure as code.
- Containerize and deploy custom code using Azure Kubernetes Service (AKS) or Databricks Jobs as required.
- Instrument monitoring and alerting with Azure Monitor, Log Analytics, and Databricks native tools.
- Diagnose and resolve performance bottlenecks in distributed Spark jobs and pipeline orchestrations.
- Collaborate with data scientists, BI engineers, and business stakeholders to design and deliver scalable data solutions.
- Document design decisions, create technical specifications, and enforce engineering standards across the team.
Required Skills & Experience:
-
Hands-on with:
- Azure Data Lake Gen2, Azure Data Factory, Azure Synapse Analytics, Azure Databricks
- PySpark, SparkSQL, advanced SQL, Delta Lake
- Data modeling (star/snowflake), partitioning, and data warehouse concepts
- Strong Python programming and experience with workflow/orchestration (ADF, Airflow, or Synapse Pipelines)
- Infrastructure automation: ARM/Bicep, Terraform, Databricks CLI/API, Azure DevOps
- Deep understanding of Spark internals, cluster optimization, cost management, and distributed computing
- Data security, RBAC, encryption, and compliance (SOC2, ISO, GDPR/DPDPA)
- Excellent troubleshooting, performance tuning, and documentation skills