January 27, 2018

Dude... I Should Totally Start a Hedge Fund... Part 2

January 27, 2018

Well, I've gotten around to publishing a draft of my Stock Value Index dashboard on Tableau Public. It has some nifty features that might not be obvious at first glance.

Filter the second chart using the first.png

First, you can filter the results shown in the bottom chart by clicking any number of rows on the top chart. This is called a Dashboard Action: Filter. The second chart displays the calculated Values of the metrics, in addition to the Index Rank of those values. The latter are in parentheses.

Second, by hovering over a security name, you can use a Dashboard Action: URL to jump from the Dashboard to the Google Finance page for the security. This is done by assuming that this URL can be completed with a stock ticker value where you see <stock ticker>: "finance.google.com/finance?q=<stock ticker>".

I did manage to resolve the granularity issue that was causing gaps in my data when I ran up against the daily API call limit mid-ticker. Now I track every unique Stock Ticker + Item (Metric) individually, noting the timestamp of when that Stock Ticker and Metric were last pulled, and prioritizing a refresh accordingly. Each night, I pick the oldest ticker metrics and refresh them. This took a significant amount of rework within Alteryx, only because the shape of the data I got from Intrinio changed when I made the change from 10 metrics per line to one metric per line. If I had left even two metrics in one API call, the resulting JSON structure would have remained the same. It didn't, so I just rebuilt everything according to the new one-value-per-row structure.

The bad news is, I definitely need to need to revisit how I rank negative cash flow. For example, HPE comes up quite high on the value index at a Value Index Composite Rank of #16, due in large part to the very strong #17 received for having a Price-to-Cash-Flow value of -159.79! That's actually straight up awful and really shouldn't be rewarded. It also means that there are 16 other companies with an even worse P/CF that are rewarded even higher. If I had to guess, HPE's P/CF Rank needs to look at lot more like its P/E Rank, which is within spitting distance of the bottom decile.

Joseph Schafer

January 5, 2018

Scripting Google Sheets Filters...

Joseph Schafer

January 5, 2018

This one was way, way harder than I would have figured, but I should come to expect no less from Google's App team.

So, Filters are a great feature in a spreadsheet. I love them. You love them. We all love them.

For some reason, Alteryx's Google Sheets Output Tool is honoring filters. That is, if you have a row filtered out, and You ask the Google Sheets Output Tool to "Overwrite Sheet (Drop)" which sounds an awful lot like "nuke that shit from orbit", well, it won't nuke row 2, because it's filtered.

Yeah, that's kinda bad. One upside to this whole "honor thy mother and filter" concept is that it does keep the formatting on the sheet, which is, probably good? I dunno.

So, I needed to script the "clearing" but probably not the "removal" of every filter in the spreadsheet. Seems simple; I've done all kinds of crazy formatting, so this seemed like a trivial task. Nope. You can do it though. You just have to go through some crazy hoops. Here we go:

You have to "Enable Advanced Services" What follows is a lot of annotation, which will quickly become obsolete, because Google hates me^tm.

To use an advanced Google service, follow these instructions:

In the script editor, select Resources > Advanced Google services....
In the dialog that appears, click the on/off switch next to the service you want to use.
At the bottom of the dialog, click the link for the Google API Console.
In the console, click into the filter box and type part of the name of the API (for example, "Google Sheets API")
On the next screen, click Enable API.
Close the API Console and return to the script editor. Click OK in the dialog. The advanced service you enabled is now available in autocomplete.

Bad news, too: You have to do this for every "project" (spreadsheet) that you want to script these sorts of filter clearing behaviors on, and find a way to run the script before Alteryx outputs its data. Not especially cool... Hopefully Alteryx can come up with a better solution.

Anyway, once you've got all that done. You can run this script:

function clearFilter() {
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var ssId = ss.getId();
  var sheetId = ss.getActiveSheet().getSheetId();
  var requests = [{
    "clearBasicFilter": {
      "sheetId": sheetId
    }
  }];
  Sheets.Spreadsheets.batchUpdate({'requests': requests}, ssId);
}

Joseph Schafer

December 21, 2017

Dude, I Should Totally Start a Hedge Fund.... Part 1

Joseph Schafer

December 21, 2017

This post has been over a year and a half in the making. Back around June, 2016 I came across this post in r/investing on reddit. It discussed an intriguing investment strategy that was apparently originally developed by James O'Shaughnessy in his book, What Works on Wall Street.

This strategy attracted my attention for two reasons:

First, it is built upon value-centered, buy-and-hold concept; the kind of strategy I would hope my granddad and Warren Buffet would approve of.
Second, it uses data to drive investing decisions. There's no gut instinct here, no testosterone, just cold, hard facts.

I was about six months into Alteryx and instinctively I knew "Alteryx can do this" but while the instinct was there (and correct, I might add), I didn't yet have all the skills within Alteryx to make it happen and while I was successful at the time in finding some of the metrics via free data sources like Google Finance, I couldn't get all of them.

In early 2017, I signed up for an ambitious cross-functional project with my Marketing Department. Part of the appeal was their appetite for unconventional data sources. I built multiple workflows for them based on third party APIs and fell in love with the art of compiling an efficient set of API calls, and then teasing apart whatever data came out the other side of my Download Tool. I built up skills like parsing JSON files, which is far more involved than just dropping a Parse JSON tool into an Alteryx canvas, and using temporary files and a Dynamic Input Tool to read them back into a workflow.

All the while, when I came upon some spare time, I'd pick the project back up, sifting through the internet for any programmatic data sources that might yield the metrics I was hoping for. Nearly 17 months later, I came upon Intrinio. This time, I was ready.

Intrinio is awesome. They're a little company with a vast horde of data and these kind entrepreneurs recognize that there's folks like me out there: I have zero budget for a project like this, so I can't afford the insane subscription costs that most Financial Data Firms like Thompson-Reuters or Bloomberg might charge for the privilege of querying their data. Still, Intrinio only offers 500 free API calls per day against their data-point API function and each unique metric per ticker counts as one call. I'm an S&P 500 guy, so I needed to make 5000 API calls to get a complete data set because I needed to pull 10 data points to calculate my six value metrics.

Allowing for some testing and distractions, it took me about three more weeks to get two workflows up and running. The first workflow just gathers the data: It inputs the authentication credentials, the tickers that had't had data pulled yet, the list of metrics to pull. It appends credentials to tickers to metrics and compiles the complete list of remaining API calls needed and sends them off to Intrinio. The successful results get stored in a time-stamped output file after a minimum of processing - just enough to get the data parsed into clean columns and time-stamped.

There's something beautiful about a well-documented, carefully arranged Alteryx workflow.  #DataisBeautiful — There's something beautiful about a well-documented, carefully arranged Alteryx workflow. #DataisBeautiful

A second workflow brought in each day's files, did additional reshaping of the structure of the data from long and narrow to short and wide; it found the latest data points for each ticker, collated them, cleaned up the data types and the "not meaningful" data points and then "sort-ranked" the list of tickers by each metric, giving each one a score out of the total number of tickers. At the end, the indexes of each metric were summed and voila - A Composite Value Index for the S&P 500.

There's a bit more to do, but the proof of concept is complete!

I confess, I think there's still a little bit left to do:

For starters, I think I truncate a given ticker's metrics when I hit my 500 API calls for the day mid-ticker. My list of API calls to make isn't at the Ticker-Metric granularity that it needs to be to prevent this; it's at the ticker level. That'll be a fairly simple fix to make.

An improvement: Replace P/E with E/P. Currently, all but one metric is sort/ranked lowest-to-highest. For any metric with no data or "nm" I have to just stash those companies at the bottom of the sorted list in no particular order. This model would have better data if I used Earnings to Price and sort/ranked it highest-to-lowest, instead of Price-to-Earnings, since P/E results in a divide-by-zero when Earnings are negative. This would give me a more detailed data set for all the companies with negative earnings, while still telling the same story, so I'll look into that change in the future.

Lastly, and perhaps most vexing, the Price to Cash Flow metric has a fair number of data points that are negative because their cash flow is negative, which is definitely worse than a positive cash flow, because while a near-zero score is very good, a negative P/CF is definitely questionable but only possibly an unhealthy metric.

The most polite of beverages and buy outs

A thirst-quenching example of that is TAP, which has a super-high reinvestment metric which is swallowing up its net profit after taxes, which I believe is due to their leveraged and aggressive consolidation strategy. Molson was tiny relative to Coors when it announced the buyout in 2015 and since then they've continued to buy microbrewery labels. I think what's going on with this particular situation is that their short term reinvestment activity is larger than their cash flow for the same period, and that's not necessarily a bad thing. That's a very normal scenario for a startup climbing a production s-curve, like Tesla. Bottom line, if their eventual return on investment ends up making up for their current reinvestment, they're probably fine and maybe they're doing awesome!

Thus, it seems to me that Price to Cash Flow being negative in isolation is not sufficient to determine whether we're talking about a highly confident company piling money into a successful strategy to maximize its share of a growing industry or whether they're a sinking ship with quarterly cash flows that are insufficient to cover their costs. I'e decided that, for now at least, a negative P/CF will rank as good as it gets in my model, but I'm likely to revisit the issue once I look into the individual stocks. I think that given we're in a bull market and low interest rates, and given we're talking about the S&P 500, with nary a start up in sight, I think this isn't that big of a deal and at any rate, within the model I've still got five other metrics to buoy or sink a given ticker; so we're not making any decisions on P/CF in isolation.

Joseph Schafer

October 27, 2017

Adventures in Marketing: Connecting to SFDC using Alteryx

Joseph Schafer

October 27, 2017

Partly due to a multi-year company-wide effort to get the most out of our SFDC investment, I've been spending a good deal of time getting data from SFDC, through Alteryx, and ultimately into dashboards via Tableau.

My experiences with SFDC as a source of data have been quite varied, but overall trend positive. On the upside, it is a versatile front-end instrument for all our sales efforts, and leveraging that data is a boon for any sales oriented tasks. What I can acquire by querying SFDC is what salespeople, account managers, and executives create, alter, and act upon. That's awesome!

Further, the design of SFDC's standard table structure is quite intuitive, with primary keys and secondary keys in ready supply; in most situations, I can reliably infer what keys are used to traverse tables, particularly when I'm able to stick within the standard tables.

Spirit Guides are really just Subject Matter Experts equipped with tribal knowledge!

A key hurdle to integration of SFDC data are our custom tables and the custom fields within the standard tables. From my experiences, there are a lot of cooks in our SFDC kitchen. Not present in our implementation are the strict governance requirements often found in conventional EDW databases. As a result, a wild-west environment tends to flourish given that, you may find yourself in need of a spirit guide.

For example, obsolete fields must be distinguished from their replacements. In one particularly humbling instance, two attributes existed, literally Program_Status_Account and Status_Program_Account. Only one was appropriate and it was not intuitive which one was the correct instrumentation. In other cases, you may find some data is shaped differently, due to prior requirements, or an ETL gone awry.

In the best cases, your observations will shed light on these issues: null records can be populated, munged data reloaded, obsolete attributes deprecated.

A query that pulls a count of registered accounts and a count of purchasing accounts in the current fiscal year.

From my perspective, this task is made more tricky due to a lack of transparency when viewing front-end facing reports. There are API names that don't align to their front end counterparts, for example. I suspect there are folks who can pull reporting that aligns API fields to front end fields, but I have yet to meet those privileged admins. Fortunately for me, STX has a stable of really talented analysts who are experts on SFDC for me to collaborate with. I help them synthesize and automate their data with other data sets, and they keep me informed on the SFDC goings on.

Joseph Schafer

August 28, 2017

Adventures in Marketing: Connecting to an API and the Download Tool

Joseph Schafer

August 28, 2017

This is one of the cooler workflows I've built in Alteryx:

A finger on the pulse of the internet...  Can you spot the month with no Reddit data? — A finger on the pulse of the internet... Can you spot the month with no Reddit data?

When working with the Download tool, to start, imagine you're building one or more URLs. In total, the URL will likely: identify who you are, what you want, and for what time periods you want it.

In this case, I want a count of mentions of Seagate that are positive, negative, neutral, for various forums and social platforms on the internet, for 6 complete historical months and the current month.

The Sysomos API has great documentation on how I write the various API calls that, when combined, will give me all of these details without hitting a row limit on their responses.

The green boxes on the left are built to quickly reconfigure the API requests.

For ease of maintenance, I choose to break out the various API inputs such that I only ever need to edit one cell to make a given change to the API call. In total, I create 21 distinct API calls to get all the data I need for this workflow. I make requests based on three sentiment types, across seven calendar months: (positive, negative, neutral) x ( -6 thru 0 month ages).

These are all the "common" payload parameters for the Sysomos API

These are common inputs. You should always expect to have a root URL. You'd usually also expect to have at least one or two API Keys that uniquely identify you as the requester. It's possible that publicly available APIs don't require a Key. Notice how I don't use all of these arguments. I filter the rows out if they resolve to null.

Humans like many rows, machines prefer columns...

You might notice I enter each API parameter on a unique row and transform them using the crosstab tool. I do this because rows of data are very friendly to read and manage in Alteryx, and that's not true if I have more than a couple columns. So I crosstab the columns in order to turn the variableNames into headers for columns, with a total of 1 row of data (the variableNames).

You'll find that if you have one row of data with all common data, then you can append the varying data in order to get all of the unique combinations you're going to need for a complete data set. I do this twice, once for the 3 sentiments, and then again for the 7 months.

Now, for a simple GET request, it's been my experience that configuring the download tool is pretty straightforward. I have had success leaving most settings defaulted, but I'd review your unique use case (and the API manual) to ensure you're complying with their requirements.

Here's how I'm set up. First, I tell it where to find the URL and I do tell it to encode the URL Text. That converts weird characters into URL compatible characters. That's including the "colon" I use to delimit the various forum types.

Configuring the payload based on the parameters I decided to use.

Now, in this case, all my headers are static; they never change, so I just pre-compile them into my URL. It may be that you need to vary your headers, in which case that tab applies to you, (Headers are just the parts of your URL separated by "/"). So, skipping to the Payload tab, I have the following attributes that I call, each of which is a unique column in my data.

You'll notice this is also where you would change the API type from GET (or FTP) to POST, etc.

You can also just take it all from one field or just in a big text input box, I think, as an alternative to checking multiple values by field. I could see that being useful in more strange APIs that require enhanced customization.

So far, every API I've worked with has returned JSON, XML or csv. Most only offer the first two, so check out the JSON Parse and XML Parse tools to wrap up your workflow. Once out the other side, it is likely that you will need to further refine the data, particularly if you get JSON outputs. Oftentimes the data is not structured in a tabular way because it's nested into groups of records. If that's the case, the good news is, it's still programatically structured and you can programatically restructure it!

For my part, if I can, I've tended to use a csv output, writing the data to as many temporary files as I need, one for each API call, and then I'll use a dynamic input tool in order to pick up those files and continue my work. The reason why I prefer csv is that the data is already given in columnar format, obviating quite a few steps, particularly when compared to JSON, which requires a handful of transforms before heading to Tableau for visualization.

Joseph Schafer

August 28, 2017

Adventures in Marketing: Google Analytics via Alteryx

Joseph Schafer

August 28, 2017

Connecting to Google Analytics in Alteryx offers amazing Automation for Marketers.

Install the Google Analytics Tool

The bad news is that getting Google Analytics set up in Alteryx is a labyrinthian process, particularly getting the API Key set up. But once there, you'll find yourself in a very happy and highly automated place. To start, you'll want to download the correct version of the tool here. That's the easy part, although even that is fraught with peril, as figure 1 shows.

you need a Google API Key!

Next, to get yourself a Google Analytics (and hopefully Sheets) API key, you'll need to travel through the seventh circle of hell. TaraM has a very helpful blog to guide posted here, though if you're a Seagate employee, reach out to me directly, as I have a customized version of the guide due to some internal restrictions affecting the process. In preview, there's at least 17 oft-changing steps to check out a Google API Key, but once you're through it, you'll have three important credential components: client id, client secret, refresh token.

The credentials then allow you to log in more or less permanently via server to create automated connections using a static version of your account credentials to pull Google Analytics (or Google Sheets) data via Alteryx Designer or Alteryx Server.

Awesome, now what?!! Well, assuming your Google Account has access to a provisioned Google Analytics Accounts and their associated WebProperties, you should see there are about 5 pages worth of configuration settings to work through in the Alteryx Google Analytics Tool. Successfully configured, you'll be able to automate those results and schedule your reporting of Google Analytics Data!

A quick note of warning: Once operational, you will likely soon find a nasty-gram in your inbox from Google. It'll say that you've exceeded your daily quota, been sent a (Not Modified) response code, or your usage hit various rate limits. You might consider caching your results in a YXDB and developing the rest of your workflow based on that data; be sure to only pull exactly the data you need for your viz; and lastly be aware that there are some tools for adjusting the timing of actions in a workflow, but for now these are all beyond the scope of this article.

Tip: USe built-in INdexing Of Time for Easy Trending

Back to the data at hand, I'll plug "nthWeek", "nthMonth", etc, because these dimensions dynamically allow you to sort time periods like months chronologically in Tableau, regardless of how those fields would sort alphabetically or numerically. You can also use them to filter out particular months, like an in-period that isn't yet complete.

I personally like to cast the month periods as "short months" (e.g. 01 -> "Jan", etc) using the switch() function in a Formula tool to parse out the numeric month values into written shorthand.

Then, in Tableau, I'll put [Mon] in the column shelf and sort it by the Average of NthMonth, which arranges the months into chronological order!

The end result is a super clean look that can scale down to a small size without compromising readability.

Joseph Schafer

May 10, 2017

A quick transform to summarize and present data.

Joseph Schafer

May 10, 2017

In Alteryx, there are some little tricks that I find myself reusing time and again One example of this is the use of Cross Tab to create an in-Alteryx pivot of the data, usually summarized at a high level with a familiar time segment as columns, to trend out the data and validate that it's accurate.

Since the goal of this is to make it human-readable, I almost always also take the time to sort out the segmentation so it occurs predictably and according to established precedent. For Seagate, we put the enterprise segments first, followed by the client segments and then the flash and consumer segments. This tends to be true no matter which topic is being discussed, and it's easy enough to do with a quick use of the switch function in a Formula tool against [Consensus Segment], followed by the Sort tool later on, after the Cross Tab tool has been invoked. I tend to do a little clean up after the fact, deselecting my index dimension using the Select tool.

Joseph Schafer

April 21, 2017

Commenting Alteryx; Because Alzheimer's runs in the family...

Joseph Schafer

April 21, 2017

About 62.5% of my ancestors rocked out to the surreal psychedelic trip that is dementia, which is probably mostly because they tend to live well into their eighties and nineties. I'm certainly not complaining - I'm saying I forget things!

Like why I built what I built in Alteryx.

I would guess that a fair share of Alteryx users aren't trained programmers, and even so, they may not be big on the whole commenting concept that's out there in that field of practice. I would guess that's particularly likely if you're coming from the business side of things. Folks like me were brought up on Excel, where there really isn't much accommodation made for annotation and even named variables - an example of self-documenting code, are a relative rarity.

So, here's what I do, in the process of building out an Alteryx workflow: I use the comment tool with every single tool to articulate what I'm doing and why I'm doing it. Oftentimes when I'm building a new workflow, the annotations are my starting point, rather than an afterthought.

Perhaps it's surprising, but when faced with a blank page, I typically find comments to be an easier and more efficient way to get started. By organizing my thought process in the comments first, followed by tools that align to those steps, I can easily order and reorder my operations without stressing about the details of configuring tools. All those tools have to be configured to work right without complaining, whereas I can often jot down the concepts I'm working through in prose more quickly.

Of course, this is just one way of approaching a blank page and I do tend to get some tools out on the page pretty quickly.

Bottom line: Comments save a lot of time over the long haul, and it may even be a faster way to think out a new workflow. It's like showing your work on a math problem. It's usually worth the extra effort.

Joseph Schafer

April 2, 2017

Put a fork in it: Duplicating data in Alteryx to work around Tableau's limitations

Joseph Schafer

April 2, 2017

In an exceptional situation, I've been asked to produce a fancy table in Tableau - like putting all the subtotals at the end, instead of interspersed, like normal. Crazy, right?

It turns out, this is harder than one might think; probably even impossible if you don't pre-process the data ahead of time before bringing it into Tableau, 'cause apparently Tableau has some pretty stern opinions about how tables should look.

The use case here is to produce a table which has columns of historical weeks of shipments, followed by future weeks of demand, followed by monthly subtotals, some actual and demand quarterly totals, a historical quarterly total, plan totals, and performance to plan and prior quarter. Essentially, it's the same numbers, just summarized across different sized and overlapping segments of time.

In short, midway through that use case, Tableau is like "well, fuck all that" and flips the table and goes home to what I imagine is a pretty awesome mancave to brood about how I just insulted its whole raison d'être.

Enter Alteryx.

In this example, I need to arrange my "Actuals" a few ways:

first in weekly columns (column numbers 1~13),
then I show the weekly UCD columns (column numbers 21~33),
then I display the Monthly Actual Totals (#s 41,42,43),
then the QTD Actuals (# 44),
Prior QTD Actuals (#45), QTD / Plan (#51), Prior Attainment(#52) and finally the UCD Qtr Total (#53).

For this I use formula tools after forking the data. I use the same Formula tools to simultaneously code the [Column Labels]. I'll put both the [Column Number] pill in the Column shelf first, followed by the [Column Header] pill and then hide the headers for [Column Number] so that they don't show up in the viz.

annotated std actuals in tableau blurred.png

Notice how it doesn't really matter that there are gaps in the [Column Number]; All Tableau cares about is the fact that they're all numbers and thus they can be sorted numerically.

I filter the data where necessary, thereby duplicating only the data I want, by forking it and then I union the repeated data back together to allow for subtotals to co-exist in the same table.

Now, there is one important risk to this kind of solution: When I duplicate data, I run the risk of having someone come along, not understanding what's going on, and mis-interpreting the results. In this case, one must invoke either the [Column Header] or [Column Number] dimension to de-dupe the data. While, this isn't unfamiliar territory for any analyst that pulls in multiple snapshots in one table, it is worth mentioning. To alleviate some of this risk, I have a disclaimer in a special warning calculation at the top of the list of dimensions within the Tableau workbook as a courtesy to my fellow analysts that might open the hood on this workbook.

Joseph Schafer

April 2, 2017

Working within the constraints of Business Objects

Joseph Schafer

April 2, 2017

For most folks at Seagate, the official source of truth is SAP Business Objects. It offers a rich semantic business layer on top of our EDW and is well-governed by a talented and responsive team within our IT department.

Answering a typical business question usually entails the simultaneous consideration of multiple datasets originating out of Business Objects. Take for example, the daily evaluation of our in-quarter shipment performance compared to prior quarter performance and two predictive data sets: our quarterly financial plan and the latest demand signal. In my world, I'd need to pull three different data sets: Actuals, Plan, Demand. In any well-groomed dataverse, data sets such as these are maintained such that there's substantial dimensional commonality across all three.

At Seagate, this is possible from within Business Objects and for particular use cases I do provide reports directly to users from within an automated Business Objects report scheduled, exported, and emailed to users.

That said, Business Objects and Excel reports are rarely my customers' preferred presentation layer and seldom my preferred final data preparation layer:

To the former point, my users often demand features such as mobility, interactivity, or embeddedness that the Tableau platform offers and our Business Objects environment simply does not.

To the latter point, my team often gets asked variations on the same question and as a matter of operational efficiency, we'd all prefer to offer the same answer to the same question. We can do that by resorting to the same Business Objects queries which we use to store the results in ephemeral staging tables that we can all reference simultaneously for any particular business case.

Since we're now in the business of building staging tables, we can also use multiple queries within the same subject matter to work around the row-count limitations imposed by both the Business Objects platform and Excel files. We'll refresh historical quarters using big, quarterly queries, prior weeks with weekly queries, and in-week results with an hourly schedule that matches the refresh rate of our EDW platform which pulls from our transactional platform hourly.

Now, that might seem like a big jump into ominous Shadow IT territory, but consider all the benefits:

We've drastically reduced unnecessary refreshes of stale data, thus diminishing the resource demands coming from our team on an aging BI infrastructure.
We've de-duped queries that hitherto were sometimes prone to entropy, especially when seen only from 20,000 ft. This also reduced demand on our BI infrastructure.
Our team is standardized on augmented data tables with all eyes on the remaining queries, which are seen by myriad sets of eyes and from all kinds of angles, ensuring our source of truth's integrity.
Because the staged data is augmented with a touch of metadata and available at a detailed level, we're ready to quickly answer the next business question from within our preferred presentation layer, circumventing Business Objects query times that can take up to an hour to complete.

While I'm at it, I add a bit of metadata to facilitate the refresh process and add some quality of life improvements for quicker analysis and a cleaner presentation.