Un-Silo-ing Your Insights: Translating Algorithms to a Web Application

Article by Nick Walsh — Jun 18, 2020

Data experts and practitioners in pattern matching (think analysts, data scientists, engineers, statisticians, and consultants) tend to have an immense amount of siloed intellectual property.

It lives in shared drives, email inboxes, and work-issued machines everywhere. Spreadsheets with formulas competitors would love to see. Scripts that coworkers wish were more accessible within the organization.

The process of breaking these data silos into insight-driven applications takes many forms, but the end result always seeks three improved traits:

Access, for a larger audience
Automation, for a repeatable process
Scale, to increase the potential impact

After picking the right piece of isolated insight to transform, achieving those three traits in a secure manner hinges on a new set of factors. We’ll focus on access and automation, as they tie in closely with the process behind new software.

The Realities of Access

Once a formerly confidential or protected algorithm is made available for others to use and manipulate, a new set of security hurdles emerge. Doubly so if the target audience is outside of the originating organization; competitors are guaranteed to put it through its paces.

Conceptually, we’ll use a spreadsheet to illustrate the need. Access comes from pulling formulas into a custom (and completely inaccessible) server-side black box, then choosing which inputs and outputs users can interact with. It’s similar to setting a cell as Hidden, except editors won’t be toggling the protection on and off to make tweaks that are colocated with normal customer use.

The end result can cover any number of practical applications. Things like financial predictions, supply needs, and project estimates tend to lack a useful degree of interactivity for everyone that could use it.

Releasing previously siloed calculations as an application covers three immediate needs:

Consumers are able to self-service, instead of tying up analysts with phone calls and emails to run things manually
Internally, you’re able to work on the next insight instead of losing time to continually reapply this one
The organization gains an opportunity to realize additional revenue through a new, self-service SaaS product

At the same time, providing access to your intellectual property raises a few key concerns.

Reverse Engineering

With an interactive model, users will inevitably seek to reverse engineer the output. Ideally, you’re relying on something more complex than linear regression as an algorithm, and the output alone won’t betray the inner workings. Use of lookups, thresholds, and constants also goes a long way towards staving off copycats.

As a general rule of thumb, If it’s simple to figure out, there’s a good chance that it doesn’t need to be a custom application in the first place.

Interpreting Values

There’s a pretty big benefit to manually running numbers. The analyst delivering the results can also provide context and actionable suggestions.

As calculations become self-service, are there values that users could misinterpret or take the wrong action on? Are there opportunities to add features to the application and assist users with data literacy?

Network Access

We’ve mentioned a server-side black box, and with it comes the need for a network connection. Pursuing client-side (and offline) execution can’t protect the underlying code adequately from someone actively attempting to decipher it. So, if public consumption is the goal, there is a necessary requirement for network connectivity.

Starting Automation by Stopping

Formal version control is an awkward obstacle for many analysis-based professions. The idea of automation runs counter to the usual process — that is, trying a new iteration each time. To codify a model, a decision on the right way (for now) has to reach consensus.

Changes are still expected, supported, and necessary, but too many tweaks during development is the mortal enemy of budgets and timelines. More questions arise, too:

Can users select from different versions of an algorithm as tweaks are made?
If not, and old output is accessible to users, is it updated to reflect the new version?

Uniqueness

Version control has a sibling in the complexity family: supporting unique alterations to the algorithm. Whether it’s per product, per customer, or some other subdivision of the model’s application, these tweaks also aren’t bad by default, but introduce a few more tricky questions (which seems to be the trend).

The biggest hurdle is figuring out how to handle updates to portions that aren’t unique. If a shared part of an algorithm changes, should every unique instance also update that portion of the calculation? Will that cause any issues with the nonstandard parts?

Looking towards future maintenance, custom bits compound to make change more difficult.

Production-Ready

As formulas are productionized, performance takes a new priority. From a single user’s local machine, processing speed is an acceptable tradeoff with coding speed. If anything, running a script can just signal a coffee break.

With an expanded audience, slow execution is the enemy. Take recommendations from a video streaming service: If someone is forced to wait several seconds between each personalized item, they’re far more likely to skip straight to the alphabetized library.

Extra Security, Post-Silo

With the basic approach to security out of the way and our key concerns around automation addressed, we’re left with a server-side model that’s only accessible through chosen inputs and outputs. Many projects can check the security box at this point, but there are a few additional ways to further harden the result when needed.

Passing Server Data

Inputs and outputs will generally pass back-and-forth to the server-side portion with field names that can give prying experts hints as to what’s going on behind the scenes. Fortunately, they’re easy to obfuscate (and Phoenix LiveView even does it automatically).

Rate Limiting

We talked a bit before about performance being paramount, but speedy calculations can also help with brute force reverse engineering. Rather than slow things down, limiting a user’s submission rate is usually a better place to start.

Usage Monitoring

The sky’s the limit in deciding how big (or small) the potential audience is. Options beyond everyone, everywhere include:

Limiting usage to certain IP addresses, or an organization intranet
Placing access behind a login and regulating account creation
Flagging suspicious user behavior for review

Emerging from the Cocoon

As with any custom software project, the decision to move insights into a productionized application requires time and budgetary investment to get off the ground. The end result should cover the three primary traits (access, automation, and scale), but there are two core requirements:

Value: Have access, automation, and scale been applied to a problem that stymied an analyst’s ability to find new insights? Or, have we surfaced sufficiently valuable intellectual property to a large audience?
Security: Does our end result appropriately protect proprietary information from being reverse engineered, while still allowing for updates?

It’s not an easy process. Software is inherently tricky, and automation goes against instincts common in the experts seeking it.

The good news is that, by release, going through the process begets a better process — especially moving forward. Automation allows teams to move beyond a manual present, and access provides more data than ever to further refine that automation.