Ben Hines Posted July 24 Posted July 24 (edited) I'm building an asset tree using Python in Seeq Data Lab which will contain thousands of signals, conditions, and such. The path is always 3 levels deep and looks something like Product Category->Product ID->Quality Recipe->Quality Signal (or Condition). The connector which defines the signals and conditions for my datasource encodes the prescribed hierarchy path as the "Description" property on each Signal/Condtion. My logic looks something like this: Get the existing hierarchy tree using spy.assets.Tree(...) Use spy.search(...) to get all of the signals and conditions that I'm interested in Use information from the asset tree and search results to determine: What branches do not yet exist in the tree Which signals+conditions do not yet exist in the tree. All of the branches to which the new signals will be added Build the new branches in the tree Add the new signals+conditions to the tree as follows: For each of the branches where new signals will go: Extract a new dataframe from the search results dataframe. This dataframe will have only the new signals that belong to the current branch. Use the spy.tree.insert(...) passing in the parent_path as the branch and the children as the dataframe containing the new signals that belong to the branch. This allows me to run my script to update the existing tree as efficiently as I've been able to figure out. The problem is that this script still takes forever. Most of the time is spent in the calls to spy.tree.insert. Is there a better and more efficient way to call this function? Edited July 24 by Ben Hines
Seeq Team John Brezovec Posted July 24 Seeq Team Posted July 24 Just to double check -- it sounds you're trying to run this script on a schedule in order to keep the tree updated as new signals come in from the datasource, is that correct?
Ben Hines Posted July 25 Author Posted July 25 Hi John. Eventually, that's the goal. For now, I'm just stepping through in a notebook to make sure its correct and performant.
Seeq Team Solution John Brezovec Posted July 25 Seeq Team Solution Posted July 25 Got it! When working with large trees with spy.assets.Tree, you want to call insert as few times as possible. The way to do that is to insert using DataFrames. The first workflow that I would try is: Get the existing hierarchy tree using spy.assets.Tree(...) Use spy.search(...) to get all of the signals and conditions that I'm interested in Manipulate the results of spy.search to construct a DataFrame to add columns 'Path' and 'Friendly Name', which represent where in the tree you want to place the item, and what you want its name to be in the tree. Insert the entire DataFrame into your tree, don't worry about inserting items that already exist or not (they'll just get overwritten) When pushing the tree, specify a metadata_state_file. This file enables 'incremental pushing', meaning SPy will only push items that were not previously pushed. This should dramatically decrease how long it takes to repeatedly push large trees with small changes. An example of this on example data (anyone should be able to run it on their Seeq instance): import pandas as pd from seeq import spy tree = spy.assets.Tree('Insert with DataFrame', workbook='Example of DataFrame Insert') tags_to_insert = spy.search({'Name': 'Area ?_Temperature', 'Datasource Name': 'Example Data'}) tags_to_insert['Path'] = tags_to_insert['Name'].str.extract(r'(Area \w+)_\w+') tags_to_insert['Friendly Name'] = 'Temperature' tree.insert(tags_to_insert) tree.push(metadata_state_file='insert_with_dataframe.pkl')
Ben Hines Posted July 25 Author Posted July 25 Thanks for the information. I have a follow-up question... Let's say that my tree is initially empty and I want the search results to be inserted, say, 3 levels deep. I tried this: tags_to_insert['Path'] = 'Level 1 >> Level 2 >> Level 3' tree.insert(tags_to_insert) tree.visualize() The resulting tree looks like this: Insert with DataFrame My Tree Level 3 Signal Signal ... I was expecting something more like: My Tree Level 1 Level 2 Level 3 Signal Signal ... Can you let me know why this is?
Ben Hines Posted July 25 Author Posted July 25 I thought the issue might be that the complete path needs to exist in the tree. So, I tried this quick experiment: # Ensure the path "Level 1 >> Level 2 >> Level 3" exists in the tree: tree.insert(parent = sqc_tree.name, children = ['Level 1']) tree.insert(parent = 'Level 1', children = ['Level 2']) tree.insert(parent = 'Level 1 >> Level 2', children = ['Level 3']) # Set the path on the signals in the search results and add to the tree tags_to_insert['Path'] = 'Level 1 >> Level 2 >> Level 3' tree.insert(tags_to_insert) tree.visualize() This now looks like: My Tree |-- Level 1 | |-- Level 2 | |-- Level 3 |-- Level 3 |-- Signal 1 |-- Signal 2 |-- Signal 3 |-- ... Notice that the signals are ending up under a top-level "Level 3" rather than the nested one.
Seeq Team John Brezovec Posted July 25 Seeq Team Posted July 25 The path doesn't need to already exist in the tree -- what's happening here is that SPy is truncating the path to the highest common asset in the Path. Since all of the items being inserted at Level 1 >> Level 2 >> Level 3, Level 3 is highest common asset, so the Level 1 and Level 2 are stripped out before inserting. In practice with your actual tags this shouldn't be an issue, unless you're intending to have a levels at the beginning of your tree that only contain a single asset as a child.
Ben Hines Posted July 25 Author Posted July 25 Thanks for the explanation. I've implemented this as suggested and it performs vastly better than my original implementation! 1
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now