Filed under:

# How Should We Regress the Diamondbacks Offense? A Brief Lesson in Predictive Stats

Today we will look at the predictive stats xBABIP and xHR/FB to gauge how we should our offense’s performance to date

Now seems like a perfect time to take our first look at the Diamondbacks offense. We’re about 1.5 months into the season and we’re coming off a long streak where our offense has performed quite poorly. And losing A.J. Pollock for an extended period of time (as might be the case) will only make things worse. Outside of a few players, our offense has been performing quite poorly. Should this continue? Or should this increase going forward? Or even decrease further?

To help answer these questions, baseball analysts have developed “predictive” stats. These are the stats that have the “x” in front of them. Specifically today, I am going to take a look at xBABIP and xHR/FB%. These are the two stats that most often explain large changes in expected performance over small samples and are usually among the quickest to regress back to average. By looking at xBABIP and xHR/FB% and comparing to the actual BABIP and HR/FB% of our players, we can get an idea of which way we might expect our offense to perform going forward.

But first, I want to give a brief lesson on linear regression and HOW these predictive stats work.

## How Predictive Stats Work

This is something I feel like deserves an explanation to all of our readers. I think many analysts will write with other analysts in mind and that will leave the non-analytically-minded readers in the dark. Terms like linear regression and R^2 (read: “R-squared”) might make perfect sense to someone like me who took statistics courses in college and use them regularly in my profession, but they probably mean very little to the vast majority of people that don’t fit either of those criteria. These two terms are important in the usage of predictive stats so I am going to explain, at a high level, what they are.

Predictive stats work by making a prediction for a stat and comparing it to the actual, “real” stat (aka what the player’s stat is currently in real life). These formulas utilize multiple variable regression and can be quite complicated to create. However, they are simpler in form. The formulas are in the basic form:

xSTAT = C1*X1 + C2*X2 + C3*X3... + y-intercept

C1, C2, C3, etc. are all coefficients and X1, X2, X3, etc. are the actual variables (“input stats”) that make up the overall xSTAT. The co-efficients are important because they explain the “weight” for each variable; in other words, the higher the coefficient relative to the other coefficients in the equation, the more important that variable is to xSTAT. Generally, you do not need to be involved at this level within the formulas unless you want to do a deeper analysis.

Instead, the formation of xSTAT actually makes things easier for you. It takes all of these variables and weights and makes them spit out ONE number. And then you use this number to compare it to the actual result. Two variables, that’s it. This is something that is easily expressed in graphical form because you now have convenient (X,Y) coordinates for each player in the population.

So this is where we get to linear regression. Linear regression is simply a way of modeling the relationship between two variables - an X variable and a Y variable. Usually, we’ll make the X variable the “expected stat” and the Y-variable the “actual stat” though you can reverse these and get the same results. When these are plotted, you can do some complicated math (or rather, the computer does it), and it spits out a straight line that is trying to “fit” the data. What this means is that it is trying to find the single line that has the shortest combined distance from all of the points on the graph. This single line is effectively the “linear regression” of your model.

So, let’s look at some examples that I made. We’ll start with a perfect fit: where the xSTAT equals the actual stat 100% of the time.

This is a theoretically perfect fit - where your expected stat matches your actual stat 100% of the time. And in real life, this pretty much NEVER, EVER happens. Whether it’s sports, manufacturing, science, or any other field where statistical analysis happens, a 100% perfectly-aligned relationship between two variables just doesn’t happen due to randomness/variation. For instance, you would think a relationship between “balls that go over the fence in fair territory” and “is a home run” would match 100% of the time but Deven Marrero might have a different idea. In large enough samples, variation rears its ugly head, no matter how inprobable the event might be.

But this leads us to R^2 (formally called the “coefficient of determination”). You’ve probably heard this before. R^2 is essentially a mathematical measure of how “fitted” the data is. In a more complicated definition, R^2 is trying to measure the amount variation that is “predicted” by the model (e.g. the “expected stat”). In this case, the R^2 is 1.000 or 100% meaning that 100% of the variation is predicted by the input variable. Variable in = variable out. That simple.

Now, let’s look at an example that is entirely random:

This graph is entirely random. I created 200 data points in excel that were randomly picked between 1-20 and then did it again (to randomly create an “expected stat” and a completely unrelated “actual stat”). When this was plotted, the above shape was formed. Notice how there appears to be no trend to the points. They are pretty evenly distributed in a square. As such, the R^2 of this chart reads as .0023, which is less than 1%. And if you were to expand this to more than 200 data points, it would get even smaller. This would be a worthless model but sometimes it happens with the data you are analyzing.

Now, let’s look at an example that’s got a decent relationship.

Notice the shape of the data relative to the solid black line that goes through the center. They sort of match! This indicates that your model was able to at least find some manner of relationship in your variables.

For this example. I made an column of random numbers from 1-20 for the x-axis. But for the y-axis, I took the value in the x-column and added a random number from -10 to 10 (variable out = variable in +-10). It is this range (-10 to 10) that adds the “randomness” that the model can’t predict. If I set those values to 0, I would get a perfect line like I had in the first example above (variable out = variable in). If I made it bigger than -10 to 10, I would get a smaller R^2.

The R^2 of this example is about 51.67% so that means our model is explaining about half of the overal actual stat and the other half is still undescribed variation (this is the +- 10 part: our input has no control over this). This actually makes some sense, intuitively - this model is taking the input then adding +- 10 to it. 10 is half of the total range of the input variables (20) which is about 50%. Generally speaking, you won’t be able to easily make observations like this in most models but it should hopefully give you a basic understanding of what R^2 is trying to accomplish.

## In other words...

In the world of “predictive stats” we are trying to find all of the variables that we can to make a model that tries to “predict” the actual stat for any given player. The closer that we can get the relationship between the “expected stat” and the “actual stat” for a large population of players to be linear, the better our model.

The amount of R^2 that you seek in a model is going to vary from application to application. In baseball, any R^2 over 50% is generally considered to be a strong model. That doesn’t mean that models with smaller R^2s, especially as they drop below 30% aren’t meaningful in certain ways, but it does hurt their predictive ability considerably with respect to baseball’s relatively small sample sizes.

So, with that in mind, let’s take a look at the xHR/FB and xBABIP for the Diamondbacks hitters. I have 13 hitters that I am going to look at. For the sake of this article, I am not going to dive into the further workings of the formulas, but I will provide links to them for more reading.

xHR/FB% presented by Mike Podhorzer, R^2 = 0.792

xBABIP presented by Mike Podhorzer, R^2 = 0.5377

As I mentioned earlier, I chose xHR/FB% and xBABIP because they are very volatile stats that can help explain power/batting average bursts/shortages over a short amount of time. If Player X comes out the gate with 10 homers in April and a 30% HR/FB% but a 10% xHR/FB%, that would imply that he got lucky with his homers in April and we should expect him to hit homers at a significantly lower rate in the rest of the season.

However, I want this point to be clear: these predictors are obviously not perfect or else their R^2 values would be 1.000. Furthermore, players are always changing and any predictive stat will never be 100% predictive going forward for that reason. Still, they serve as an excellent starting point and considering we’re at 54% and 79% (!!!) for these two stats, they do have considerable predictive value.

## xHR/FB%

### Diamondbacks xHR/FB%, May 2018

Player xHR/FB HR/FB HR/FB%-xHR/FB%
Player xHR/FB HR/FB HR/FB%-xHR/FB%
AJ Pollock 25.3% 25.0% -0.3%
Alex Avila 21.7% 20.0% -1.7%
Chris Owings 8.0% 6.7% -1.3%
Daniel Descalso 16.4% 10.3% -6.1%
David Peralta 18.5% 21.2% 2.7%
Deven Marrero 21.7% 0.0% -21.7%
Jarrod Dyson 7.6% 7.7% 0.1%
Jeff Mathis 6.0% 0.0% -6.0%
John Ryan Murphy 23.1% 15.0% -8.1%
Ketel Marte 2.2% 3.3% 1.1%
Nick Ahmed 12.7% 15.8% 3.1%
Paul Goldschmidt 29.7% 11.8% -17.9%
Steven Souza Jr. 5.9% 0.0% -5.9%

Woof. A lot of under-performance there, ESPECIALLY for Goldy. Holy cow. Here is the HR/FB%-xHR/FB% in graph form:

Most players seem to be pretty close. And this helps to show that AJ Pollock’s power breakout this year might have been real. But man, look at that drop for Goldy. He is MASSIVELY underperforming, though this is a good sign that he should be doing better going forward. Two other intersting points are Deven Marrero and John Ryan Murphy.

I would like to point out that xHR/FB% uses HR park factors from StatCorner, which currently show Chase Field at 111, presumably from last year. I changed this to 100 until we have more humidor data. If I would have kept it at 111, it would have made us underperform even further. But this change only accounted for maybe .5% of the xHR/FB% values at most, not a huge amount.

## xBABIP

### Diamondbacks xBABIP May 2018

Player xBABIP BABIP xBABIP-BABIP
Player xBABIP BABIP xBABIP-BABIP
AJ Pollock 0.349 0.320 -0.029
Alex Avila 0.266 0.233 -0.033
Chris Owings 0.307 0.380 0.073
Daniel Descalso 0.310 0.307 -0.003
David Peralta 0.337 0.340 0.003
Deven Marrero 0.394 0.256 -0.138
Jarrod Dyson 0.274 0.194 -0.080
Jeff Mathis 0.220 0.320 0.100
John Ryan Murphy 0.305 0.281 -0.024
Ketel Marte 0.319 0.258 -0.061
Nick Ahmed 0.312 0.247 -0.065
Paul Goldschmidt 0.340 0.300 -0.040
Steven Souza Jr. 0.280 0.200 -0.080

And wow, yet another round of underperformance. Of course, it looks better in graph form:

Other than Owings and Mathis (who has barely played), our team is underperforming once again. Several of our players are sitting nearly .050 or more points below their expected BABIP. This is definitely good news for Goldy, Ahmed, Marte, and Souza Jr.

Again though, look at Marrero. Small sample size but it seems like he was hitting the ball a lot better in both regards than he’s receive credit for so far. Interesting.

Anyways, I think that I more-or-less expected this coming into this article but it was nice to see that it is a pretty dramatic effect across the board for the team. Our offense should perform better going forward than it has been. Goldy returning to form might help offset the loss from Pollock and a collective regression to the mean will probably be more positive for us going forward, even sans Pollock. Now, we just have to rely on our pitching and hope we can stay in the race long enough for Pollock to come back. Then, our team might get back to its dangerous ways.