Earlier I wrote an article about Quantum Computing. If you read that this article will be more interesting, but you can also just read this one and it will still mostly make sense.

As a bit of background, I have about 10 years of experience actively managing large data sets. These large data sets are around a quarter of a billion records with about 400 data points. The total number of data across all versions over all time exceeds billions. To me, it’s just a really big spreadsheet.

You’re not going to believe this, but with data sets that large you end up getting a lot of “bad data”. It is actually easier to describe “good data”, than bad data if you explain good data first. #GoodData is any data that just goes in the tables like it should and behaves as you would expect. #BadData is everything else. Wrong character types, column limits, and a whole long list of other things.

So now let’s get into the term #FalsePositive. A #FalsePositive is nothing more than when your algorithm or data give you a “Positive” response triggering an action (think like a security alarm going off when you walk through a metal detector), but come to find out it is a false reading (you didn’t have anything). So what is your tolerance for a false positive in your workflow? With the security example, as long as the false positives are below a certain amount, there isn’t much additional work at a slow door. At the TSA on the other hand, too many false positives could radically slow down the amount of time it takes to get through and also increase expenses. So in keeping with the idea of using #MultipleVariables with more than one use case, let’s assume we wanted to write an equation to give us the information about what our #FalsePositiveTolerance is with metal detectors. We’ll use the answer to figure out the impact on both a small/slow entry vs a high volume lobby. The reason for choosing those is that the individual guard having to manually check without needing another guard is roughly the lowest #OngoingExpense and that is probably where the #FalsePositiveTolerance could be the highest (less impacted by false positives). The reason for the high volume use case is to effectively see if we can balance the cost of the metal detectors (assuming more expensive ones are more accurate) against the cost of security guards. It is likely that in a low traffic area false positives have the lowest impact, but it might be surprising to see if the cost of two guards and one machine is better than one guard and one machine that is at the higher end of accuracy.

So we’ll need to start by having a spreadsheet of a bunch of #variables. The #variables include “how long it takes a security guard to do the additional steps required for a false positive” #FalsePositiveGuardTime will be the value. So another variably we’ll need is traffic which is basically “how many people per hour need to be screened”, let’s call that #HourlyTraffic. We’ll also need a variable for peak times, so we’ll call that #PeakHourlyTraffic. Another variable we’ll need is the hourly cost of the guard, #GuardHourlyExpense (which includes everything) and the cost of a #MetalDetector. For purposes of this example let’s call it #AcceptableMetalDetector and #HighPerformanceMetalDetector. That should be just about everything we need. So let’s start writing out the equation.

#FalsePositiveTolerance = a value of 0 to 1 where 0 indicates no tolerance for false positives and 1 indicates and effective or actual complete ability to accept unlimited false positives without any significant adverse reaction.

To set the two ends of the field, we’ll start by using #AcceptableMetalDetector at a price of $500 (I have no idea if that price is accurate) and 1 guard at the rate of $30/hour “all in”, #GuardHourlyExpense. So that would give us an #InitialCost of $500 and an ongoing expense at $30/hour during all operating times. On the other end let’s use the #HighPerformanceMetalDetector with an expense of $5,000 and assume it never makes a mistake (this would represent what we believe to be the most optimal experience possible since we are trying to see what the values look like from both sides).

So you won’t believe this, but we still need to add more variables. We need a variable for the number of customers scanned per hour with zero false positives and a twin that shows us the number of customers scanned with false positives based on the accuracy of the #AcceptableMetalDetector. The difference between those will likely indicate that the throughput of zero false positives is at least a slight edge as every #FalsePositive with the less accurate machine reduces the #HourlyTraffic the combination can support. For example, if it takes a guard 60 seconds to scan someone, that means the line is stopped for 60 seconds which would reduce the throughput of that checkpoint by at least 1/60 of an hour. But is that worth the #Cost of $4,500 for the better #MetalDetector? Probably not, but this is just an example of a really complicated piece of math that is very insightful, but just takes a while to setup.

So let’s say that the #PeakHourlyTraffic has a value of 100, this indicates that 100 people per hour go through this checkpoint at the peak of the day. If #AcceptableMetalDetector + #GuardHourlyExpense can support 100 people per hour, then it is likely not valuable enough to pay an additional $4,500 as no benefit would come from it. But let’s take that same metal detector and guard equation, and change #PeakHourlyTraffic to 1000. Let’s assume as a foundation that our current combination can indeed service #PeakHourlyTraffic of 100, but no more than that. So if we were to simply buy more of the same ratio, with one #AcceptableMetalDetector and one #GuardHourlyExpense, it would cost $5,000 for metal detectors and $300/hour to operate.

Now let’s test two theories, the first is what happens if we go all in on deluxe metal detectors, how much would we save on guard salaries as opposed to putting our cash in equipment to help the guards do their job more efficiently. Assuming these operate approximately 40 hours a week for roughly 50 weeks a year, that gives us a total of 2,000 hours per year. To just assume for the moment we wanted to see the #PotentialSavings, the lowest our #AnnualGuardExpense would be is #TotalAnnualHours of the 2000 and one guard per hour for all hours which would be a total of 2000 X #GuardHourlyExpense = $60,000. With 10 guards it would be $600,000, so the #TotalPotentialSavings = $600,000 – $60,000, for a total of $540,000. Even with the #HighPerformanceMetalDetector costing $5,000, if it could reduce the workload of the staff by 50% (just to pick a number), the savings could be very significant.

So let’s go back to the #FalsePositive value. If we assume that false positives only happen with the lower end metal detectors at a 10% higher rate than the premium ones, the #FalsePositiveTolerance would look something like the answers to the following:

What is the total hourly throughput with one guard and one low end metal detector?

What is the total hourly throughput with one guard and one high end metal detector?

Over the course of one year, how much is the #TotalMetalDetectorSecurity when combining #CostOfMetalDetector and #TotalGuardHourlyAnnualExpense sufficient to meet the #PeakTotalHourlyThroughput for a given facility.

Going back to the very beginning, if you use values of #PeakTotalHourlyThroughput of 100 and 1,000, that gives you a data point that is significant enough that you should be able to see some very pronounced savings (or something else pronounced), or it is unlikely it is significant.

So why would we try to think of these as variables? Why bother with trying to make a josh quasi-quantum equation?

Perfectly valid question and I thank you for asking it. The reason why you drill down like this is that frequently the variables that can improve your outcomes might be easier than you think. When you are playing with different variables in a spreadsheet (or a super computer) it let’s you predict out what the future might look like. To make the example we just played with more tangible, if you were in charge of the security of a building and wanted to figure out the best way to give your visitors the best experience (think like a high end bank or something), your #FalsePositiveTolerance is probably quite low (assuming you have a high budget and never want to insult a customer with a false positive). If by contrast you are running a club and people are used to waiting in line and a false positive doesn’t really make a big difference when the lines are already moving slowly, that’s probably at the other end of the #FalsePositiveTolerance (around 1). The closer the #FalsePositiveTolerance gets to 1, the less it is probably worth taking the time to figure out what the actual answer is.

To me, that is part of the beauty of trying to solve large complicated problems. In order to even begin to address large and wide scale problems you need a sensible way to evaluate the current situation and a sensible way to attempt to predict what a future would look like if you changed something. Back to the security example, if I was trying to sell high end metal detectors and knew it could save 50% the payroll for security company customers, I probably would want to market directly to the facilities instead of to the security company. Understanding the answers to those questions could literally be what makes the difference between a multibillion dollar product line or just being another offer in a sea of offers.

If you read this, thank you. I feel like part of why I have been able to learn so much is based on largely being in communities that openly shared ideas and wisdom so I like trying to contribute back. I have received far more help and information from the people around me than I could ever repay the debt on, but I’m happy to try just the same.