Jump to content

Create Pareto Charts with CDF in Seeq

Recommended Posts

Users are often interested in creating pareto charts using conditions they've created in Seeq sorted by a particular capsule property. The chart below was created using the Histogram tool in Seeq Workbench. For more information on how to create Histograms that look like this, check out this article on creating and using capsule properties. 


Often times users would like to see the histogram above, but with the bars sorted from largest to smallest in a traditional pareto chart. Users can easily create paretos from Seeq conditions using Seeq Data Lab. A preview of the chart that we can create is:


The full Jupyter Notebook documentation of this workflow (including output) can be found in the attached pdf file. If you're unable to download the PDF, the code snippets below can be run in Seeq Data Lab to produce the chart above. 

#Import relevant libraries
from seeq import spy
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

Log in to the SPY module if running locally using spy.login, or skip this step if running Seeq Data Lab.


#Search for your condition that has capsule properties using spy.search
#Use the 'scoped to' argument to search for items only in a particular workbook. If the item is global, no 'scoped to' argument is necessary 
condition = spy.search({
    "Name": "Production Loss Events (with Capsule Properties)",
    "Scoped To": "9E50F449-A6A1-4BCB-830A-8D0878C8C925",

#pull the data from the time frame of interest using spy.pull into a Pandas dataframe called 'my_data'
my_data = spy.pull(condition, start='2019-01-15 12:00AM', end='2019-07-15 12:00AM', header='Name',grid=None)
#remove columns from the my_data dataframe that will not be used in creation of the pareto/CDF
my_data = my_data.drop(['Condition','Capsule Is Uncertain','Source Unique Id'], axis=1, inplace=False)
#Calculate a new dataframe column named 'Duration' by subtracting the capsule start from the capsule end time
my_data['Duration'] = my_data['Capsule End']-my_data['Capsule Start']
#Group the dataframe by reason code
my_data_by_reason_code = my_data.groupby('Reason Code')
#check out what the new data frame grouped by reason code looks like


#sum total time broken down by reason code and sort from greatest to least
total_time_by_reason_code['Total_Time_by_Reason_Code'] = my_data_by_reason_code['Duration'].sum().sort_values(ascending=False)
total_time_by_reason_code['Total_Time_by_Reason_Code'] = total_time_by_reason_code['Total_Time_by_Reason_Code'].rename('Total_Time_by_Reason_Code')

#plot pareto of total time by reason  code


#Calculate the total time from all reason codes
total_time = total_time_by_reason_code['Total_Time_by_Reason_Code'].sum()

#calculate percentatge of total time from each individual reason code
percent_time_by_reason_code['Percent_Time_by_Reason_Code'] = total_time_by_reason_code['Total_Time_by_Reason_Code'].divide(total_time)

#Calculate cumulative sum of percentage of time for each reason code
cum_percent_time_by_reason_code['Cum_Percent_Time_by_Reason_Code'] = percent_time_by_reason_code['Percent_Time_by_Reason_Code'].cumsum()
cum_percent_time_by_reason_code['Cum_Percent_Time_by_Reason_Code'] = cum_percent_time_by_reason_code['Cum_Percent_Time_by_Reason_Code'].rename('Cum_Percent_Time_by_Reason_Code')

#plot cumulative distribution function of time spent by reason code
cum_percent_time_by_reason_code['Cum_Percent_Time_by_Reason_Code'].plot(linestyle='-', linewidth=3,marker='o',markersize=15, color='b')


#convert time units on total time by reason code column from default (nanoseconds) to hours
total_time_by_reason_code['Total_Time_by_Reason_Code'] = total_time_by_reason_code['Total_Time_by_Reason_Code'].dt.total_seconds()/(60*60)

#build dataframe for final overlaid chart
df_for_chart = pd.concat([total_time_by_reason_code['Total_Time_by_Reason_Code'], cum_percent_time_by_reason_code['Cum_Percent_Time_by_Reason_Code']], axis=1)


#create figure with overlaid Pareto + CDF
ax = df_for_chart['Total_Time_by_Reason_Code'].plot(kind='bar',ylim=(0,800),style='ggplot',fontsize=12)
ax.set_ylabel('Total Hours by Reason Code',fontsize=14)
ax.set_title('Downtime Reason Code Pareto',fontsize=16)

ax2 = df_for_chart['Cum_Percent_Time_by_Reason_Code'].plot(secondary_y=['Cum_Percent_Time_by_Reason_Code'],linestyle='-', linewidth=3,marker='o',markersize=15, color='b')
ax2.set_ylabel('Cumulative Frequency',fontsize=14)                                                 



  • Like 4
Link to comment
Share on other sites

  • 2 years later...
  • Seeq Team

We've received reports from some users encountering errors when following the above guide. 

Below are solutions to two errors users have reported:

1) NameError

NameError: name 'total_time_by_reason_code' is not defined


To fix this, define the dataframes first by inserting the definitions towards the beginning of the script:


2) Type Error when trying to plot the first bar chart

TypeError: value should be a 'Timedelta', 'NaT', or array of those. Got 'int' instead.


To fix this, the 'Total_Time_by_Reason_Code' series can be converted to hours by adding the following line of code:

total_time_by_reason_code['Total_Time_by_Reason_Code'] = total_time_by_reason_code['Total_Time_by_Reason_Code'].dt.total_seconds()/(60*60)



Note the original post performs this conversion at a later time in the script.  That conversion is no longer needed and should be removed:


Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Create New...