Foundation LabEasy

Lab: Employee Analytics

Estimated time: 15 mins

Who This Lab Is For

Beginner Data Engineers looking to learn the basic flow of Apache Beam pipelines, reading text files, and computing simple metrics.

What You Will Learn

How to initialize an Apache Beam pipeline and read local datasets.
How to write a custom DoFn to parse CSV strings into structured dictionaries.
How to use beam.Filter and beam.CombineGlobally to calculate dataset averages.

1. Business Scenario

Learn basic transformations by filtering and aggregating employee records.

2. Input Dataset (\`dataset.csv\`)

Save the following raw rows locally as \`dataset.csv\` to test your pipeline:

text

id,name,department,salary,active
1,Alice,Engineering,120000,true
2,Bob,Marketing,90000,true
3,Charlie,Engineering,110000,false
4,David,Sales,85000,true
5,Eve,Engineering,130000,true

3. Starter Code Skeleton

Create a local file named \`starter.py\` and copy the following skeleton. Complete the missing transformations:

python

# starter.py - Employee Analytics
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

def run_pipeline():
    options = PipelineOptions()
    with beam.Pipeline(options=options) as p:
        # TODO: Read from 'dataset.csv'
        # TODO: Filter out inactive employees
        # TODO: Calculate average salary of active employees
        # TODO: Write to 'output.txt'
        pass

if __name__ == "__main__":
    run_pipeline()

4. Lab Requirements

Filter out inactive employees (where active is false).
Calculate the average salary of the remaining active employees.
Print the final average salary value to the output sink.

5. Step-by-Step Guide & Solution

Solution for Employee Analytics

Click below to reveal the complete, runnable Python SDK implementation solution and the step-by-step walkthrough to complete the lab.

AdSense Slot #847392Leaderboard Banner (728x90)