Over the years, we’ve heard a lot of arguments over BABIP regression on here. The number itself tends to fluctuate from year to year, but there are players in MLB that can sustain better than average numbers consistently from year to year. One example is Diamondbacks 1B Paul Goldschmidt, who lowest full season BABIP is .340 from the 2012 season and has a career BABIP of .356. The one flaw in BABIP is it only uses absolute results with balls in play such as hits, home runs, strikeouts, and sacrifice flies. What it doesn’t tell you is how good batters are from a batted ball standpoint. The information we as fans now know tells us how much exit velocity a batter is generating, how often a hitters hits the ball hard, and other sorts of cool batted ball information.
Batted ball numbers tend to be volatile early in a player’s career or for part-time players. After a long amount of time, batted ball rates tend to stabilize a bit more along with BABIP numbers. Batters that consistently put up higher BABIPs like Paul Goldschmidt tend to make a lot of hard contact and are able to consistently drive the ball to all fields. Another high BABIP type are fast guys, some of which will out-perform even their batted ball stats like AJ Pollock. When we get to that significant a sample size where any possibility of doubt has been answered, we can all assume that the BABIP numbers are a skill and not luck. However, there hasn’t been a reliable metric that measures a player’s true BABIP based off their batted ball talent.
The search for an appropriate for calculating a batter’s true BABIP skill set started with Mike Podherzer of Fangraphs, who developed an equation for xBABIP in 2015. The equation itself has been augmented by multiple people trying to develop more reliable equations. The original equation by Podherzer started out as this:
xBABIP = 0.2530 + (O-Contact% * -0.0484) + (ISO * 0.1814) + (Absolute Value of Angle * -0.0024) + (LD% * 0.3657) + (FB% * IFFB% * -0.4531) + (Spd * 0.0046)
The initial variance correlation coefficient was .423, which is a solid number, but certainly one that can see improvements. After finding that plate discipline metrics offered little correlation to BABIP, he dropped out of zone contact (O-Contact). The biggest factors towards BABIP in terms of correlation were line drive rate, fly ball and infield fly ball rate, launch angle, and speed. After testing the xBABIP model towards the ensuing season, the correlation between xBABIP in the first year and BABIP in the second year was 40.4%.
After Podherzer laid the foundation for a potentially successful BABIP prediction model, Alex Chamberlain took the formula another step. With Fangraphs now having access to batted ball data, Chamberlain revised the equation to include opposite field contact rate, true fly ball FB%*(1-IFFB%) and true infield fly ball rate (IFFB/BIP), Speed, Hard hit rate, and an ever-changing constant that links xBABIP with BABIP. The final equation came out as:
xBABIP = .1770 — .3085*(True IFFB%) — .1285*(True FB%) + .3684*LD% + .0798*Oppo% + .0045*Spd + .2287*Hard% + Year Constant
The equation itself comes out with an R-squared of .423, which over 15 years of data covering 5771 players is fairly solid. So if we use this equation into account, we can compare xBABIP to BABIP for the 2016 Diamondbacks.
The big takeaway from the sheet comes from the right-most column, dBABIP. dBABIP is the difference of BABIP and xBABIP. A positive number means a player over-performed his xBABIP and is a candidate for negative regression and a negative number means a player under-performed his xBABIP and is a candidate for positive regression. Four of the players have too small a sample size to project for 2016. So for AJ Pollock, Peter O’Brien, Socrates Brito, and Oscar Hernandez, the xBABIP isn’t very predictive although Brito is close enough to a significant sample size that you can somewhat project his xBABIP moving forward. For those wondering about Jean Segura, he had a 2016 BABIP of .355 and an xBABIP of .319 (dBABIP of 0.036), so those expecting BABIP regression now have semi-reliable numbers to back that up.
The xBABIP metric itself isn’t perfect and still needs to be more refined moving forward, but it’s a nice comparison tool towards a player’s BABIP. I don’t think the metrics take park factors into consideration as it is easier to hit in Chase Field than it is at AT&T Park. Of course at the same time, both BABIP and xBABIP rarely account for defensive positioning (pull hitters get extreme over-shifts), although they try to mitigate that a bit with xBABIP with Opposite Field contact rate having a linear weight in the formula. As I explained earlier, xBABIP suggests a possible regression for BABIP as the former is more related to a batter’s contact skills and the latter directed towards results. BABIP and xBABIP numbers tend to be fluky from year to year due to it depending a lot of two fluctuating quantities such as line drive rate and hard hit rate. However, for players well entrenched in the majors, we should see the two metrics closely aligned. I will compare career numbers later on.