Foundation LabEasy

Lab: Employee Analytics

Estimated time: 15 mins

Who This Lab Is For

Beginner Data Engineers looking to learn the basic flow of Apache Beam pipelines, reading text files, and computing simple metrics.

What You Will Learn

  • How to initialize an Apache Beam pipeline and read local datasets.
  • How to write a custom DoFn to parse CSV strings into structured dictionaries.
  • How to use beam.Filter and beam.CombineGlobally to calculate dataset averages.

1. Business Scenario

Learn basic transformations by filtering and aggregating employee records.

2. Input Dataset (\`dataset.csv\`)

Save the following raw rows locally as \`dataset.csv\` to test your pipeline:

text
id,name,department,salary,active
1,Alice,Engineering,120000,true
2,Bob,Marketing,90000,true
3,Charlie,Engineering,110000,false
4,David,Sales,85000,true
5,Eve,Engineering,130000,true

3. Starter Code Skeleton

Create a local file named \`starter.py\` and copy the following skeleton. Complete the missing transformations:

python
# starter.py - Employee Analytics
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

def run_pipeline():
    options = PipelineOptions()
    with beam.Pipeline(options=options) as p:
        # TODO: Read from 'dataset.csv'
        # TODO: Filter out inactive employees
        # TODO: Calculate average salary of active employees
        # TODO: Write to 'output.txt'
        pass

if __name__ == "__main__":
    run_pipeline()

4. Lab Requirements

  • Filter out inactive employees (where active is false).
  • Calculate the average salary of the remaining active employees.
  • Print the final average salary value to the output sink.

5. Step-by-Step Guide & Solution

Solution for Employee Analytics

Click below to reveal the complete, runnable Python SDK implementation solution and the step-by-step walkthrough to complete the lab.

Advertisement
AdSense Slot #847392Leaderboard Banner (728x90)