Data Prep

Working within the constraints of Business Objects

For most folks at Seagate, the official source of truth is SAP Business Objects.  It offers a rich semantic business layer on top of our EDW and is well-governed by a talented and responsive team within our IT department.

Answering a typical business question usually entails the simultaneous consideration of multiple datasets originating out of Business Objects.  Take for example, the daily evaluation of our in-quarter shipment performance compared to prior quarter performance and two predictive data sets: our quarterly financial plan and the latest demand signal. In my world, I'd need to pull three different data sets: Actuals, Plan, Demand.  In any well-groomed dataverse, data sets such as these are maintained such that there's substantial dimensional commonality across all three.

At Seagate, this is possible from within Business Objects and for particular use cases I do provide reports directly to users from within an automated Business Objects report scheduled, exported, and emailed to users.  

That said, Business Objects and Excel reports are rarely my customers' preferred presentation layer and seldom my preferred final data preparation layer:

To the former point, my users often demand features such as mobility, interactivity, or embeddedness that the Tableau platform offers and our Business Objects environment simply does not.  

To the latter point, my team often gets asked variations on the same question and as a matter of operational efficiency, we'd all prefer to offer the same answer to the same question.  We can do that by resorting to the same Business Objects queries which we use to store the results in ephemeral staging tables that we can all reference simultaneously for any particular business case.

Since we're now in the business of building staging tables, we can also use multiple queries within the same subject matter to work around the row-count limitations imposed by both the Business Objects platform and Excel files.  We'll refresh historical quarters using big, quarterly queries, prior weeks with weekly queries, and in-week results with an hourly schedule that matches the refresh rate of our EDW platform which pulls from our transactional platform hourly.

Now, that might seem like a big jump into ominous Shadow IT territory, but consider all the benefits:

  1. We've drastically reduced unnecessary refreshes of stale data, thus diminishing the resource demands coming from our team on an aging BI infrastructure.
  2. We've de-duped queries that hitherto were sometimes prone to entropy, especially when seen only from 20,000 ft.  This also reduced demand on our BI infrastructure.
  3. Our team is standardized on augmented data tables with all eyes on the remaining queries, which are seen by myriad sets of eyes and from all kinds of angles, ensuring our source of truth's integrity.
  4. Because the staged data is augmented with a touch of metadata and available at a detailed level, we're ready to quickly answer the next business question from within our preferred presentation layer, circumventing Business Objects query times that can take up to an hour to complete.

While I'm at it, I add a bit of metadata to facilitate the refresh process and add some quality of life improvements for quicker analysis and a cleaner presentation.

Keeping things fresh: Adding [Latest Snapshot Date]

One of the key additions I make to our team's staging tables is a Time attribute called [Latest Snapshot Date], which I append to each subject matter dataset in the process of processing it from a .xlsx file that's originated out of Business Objects into a .yxdb file that's going to be used as an input for our more involved data prep workflows.

In most situations, analysts are keen to cite their sources, all the way down to the snapshot timestamp used.  This helps immensely when questions of freshness come up.

Personally, I like to provide a small table that lists each dataset and a timestamp.  This is trivial to produce in a presentation layer if each dataset is equipped with something like a [Latest Snapshot Date] attribute. I resort to two different ways to derive this attribute:

1) Transactional / Live Data

With a transactional or pseudotransactional dataset, where snapshots aren't provided, I resort to capturing the write time on the xlsx file written by Business Objects and append it as the [Latest Snapshot Date].

 

2) Snapshot Data

It's a bit different when a dataset has a series of snapshots.  I'll  use the Summarize to to find the maximum value within the snapshot field of the dataset and append that max snapshot as the [Latest Snapshot Date].

In some cases, the timestamp data is provided in a string and in such a situation I'll resort to a regex tool to convert it from a string into the time format preferred by Alteryx.  Regular expressions are a topic worthy of their own series of articles.  More on them in a future post!