Jump to content

Batch Genealogy in Seeq Data Lab

Joe Reckamp

Recommended Posts

  • Seeq Team

In batch processes, genealogy of lots of batch IDs is important to understand. When a deviation occurs in a raw material or batch, the use of that material in subsequent steps of the process needs to be quickly identified. In the case of pharmaceuticals, batches that use that product may need to be pulled from the shelves immediately and not used by consumers.

Batch Genealogy starts with having data in a format where the input material lot numbers or Batch IDs are tracked for each batch. In Seeq, these input material lot numbers or Batch IDs are tracked as properties attached to each Batch capsule. The properties can either be set up from the datasource connector (such as bringing in OSIsoft PI Event Frames or information from a SQL database) or they can be built using transforms in Seeq Formula.

Once the data is set up properly, visualization can be achieved with Seeq Data Lab. This article will describe one approach to Batch Genealogy visualization in Seeq Data Lab.

In this particular scenario, we are looking at 6 distinct batch chemical transformations to get to the final product and we are tracking the key raw material as it goes through the process.

1. First, use spy.search to find the batches of interest:

results = spy.search({

    "Name": "Step * Batches"



# Print the output to the Jupyter page



2. Next, use spy.pull to retrieve the records for a specific time range of interest:

step2_batches = results.loc[results['Name'].isin(['Step 2 Batches'])]

step2_data = spy.pull(step2_batches, start='2019-01-01', end='2019-07-01', header='Name')



3. In the previous step, you can see that some batches have multiple input Batch IDs. In order to track all combinations of the input batch IDs and output batch IDs, use pandas DataFrame manipulations to get a new DataFrame with just the Input Batch ID and Output Batch ID combinations.

step2_1 = step2_data[['Step 1 Input Batch ID (1)', 'Batch ID']].rename(columns={'Step 1 Input Batch ID (1)': 'Step 1 Input Batch ID'})

step2_2 = step2_data[['Step 1 Input Batch ID (2)', 'Batch ID']].rename(columns={'Step 1 Input Batch ID (2)': 'Step 1 Input Batch ID'})

step2 = pd.concat([step2_1,step2_2]).rename(columns={'Batch ID': 'Step 2 Batch ID'}).set_index('Step 2 Batch ID')

step2 = step2.loc[(step2!='0.0').any(1)]



You may notice that instead of 6 rows in the previous DataFrame, we now have 11 rows (5 batches with 2 input batch IDs + 1 batch with 1 input batch ID = 11 total combinations).

4. Repeat the above steps for each additional step in the process, creating a final DataFrame with Step 1 through Step 6 batches of all combinations of input Batch IDs from prior steps.

step5_6 = pd.merge(step6.reset_index(),step5.reset_index().rename(columns={'Step 5 Batch ID': 'Step 5 Input Batch ID'}))

step4_5_6 = pd.merge(step5_6,step4.reset_index().rename(columns={'Step 4 Batch ID': 'Step 4 Input Batch ID'}))

step3_4_5_6 = pd.merge(step4_5_6,step3.reset_index().rename(columns={'Step 3 Batch ID': 'Step 3 Input Batch ID'}))

FinalDf = pd.merge(step3_4_5_6,step2.reset_index().rename(columns={'Step 2 Batch ID': 'Step 2 Input Batch ID'}))



  • Like 1
Link to comment
Share on other sites

  • Seeq Team

5. Use a Sankey visualization to view the relationships between batches. In order to use the Sankey visualization, first an additional column will need to be added to describe the relative width of each of the combinations. In this example, we are just going to add a column that equally weights each of the visualizations. However, if input weights or mass fractions were available for each combination, that could be used in this column instead. Finally, call the Sankey visualization code taken from an example website and modify functions slightly to match DataFrame set up and desired colors.


def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):

    # maximum of 6 value cols -> 6 colors

    colorPalette = ['#FFD43B','#306998','#23B9F2','#089474','#646464','#F23E29']

    labelList = []

    colorNumList = []

    for catCol in cat_cols:

        labelListTemp =  list(set(df[catCol].values))


        labelList = labelList + labelListTemp


    # remove duplicates from labelList

    labelList = list(dict.fromkeys(labelList))


    # define colors based on number of levels

    colorList = []

    for idx, colorNum in enumerate(colorNumList):

        colorList = colorList + [colorPalette[idx]]*colorNum


    # transform df into a source-target pair

    for i in range(len(cat_cols)-1):

        if i==0:

            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]

            sourceTargetDf.columns = ['source','target','count']


            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]

            tempDf.columns = ['source','target','count']

            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])

        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()


    # add index for source-target pair

    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))

    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))


    # creating the sankey diagram

    data = dict(


        node = dict(

          pad = 15,

          thickness = 20,

          line = dict(

            color = "black",

            width = 0.5


          label = labelList,

          color = colorList


        link = dict(

          source = sourceTargetDf['sourceID'],

          target = sourceTargetDf['targetID'],

          value = sourceTargetDf['count']




    layout =  dict(

        title = title,

        font = dict(

          size = 10




    fig = dict(data=[data], layout=layout)

    return fig

fig = genSankey(FinalDf,cat_cols=['Step 1 Input Batch ID','Step 2 Input Batch ID','Step 3 Input Batch ID','Step 4 Input Batch ID','Step 5 Input Batch ID','Step 6 Batch ID'],value_cols='Count',title='Batch Genealogy')

plotly.offline.plot(fig, validate=False)


In this visualization, each different colored column represents a step in the process (yellow = Step 1, dark blue = Step 2, etc.) with the batch ID of the step denoted in the text next to it. The gray connections represent the various combinations of where that Batch ID was used in the subsequent step. With the above code, the Sankey visualization will open in a new tab and using Plotly, the visualization will be interactive. Therefore, when you highlight a specific batch, it will highlight its dependencies so you can quickly see which batches are related to that particular batch highlighted.

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Create New...