/cdn.vox-cdn.com/photo_images/2035522/GYI0063703121.jpg)
If you've been around the site long enough, you'll know there is a blizzard of abbreviations and stats which get thrown around by participants as they try to show one thing or another. It was suggested, and it seems a good one, to have a primer for the most common statistics used, so that those just dipping their toes in the water will have some idea what's being talked about. Hence, this series of articles, which will cover the main numbers you'll see. In the first part, we look at hitting stats; the second will cover pitching ones; and the third will tidy up anything left over, such as fielding stats, WAR, and so on.
Questions, comments, etc. are particularly welcome, so if anything is not clear, please ask and our crack team will respond. You will, however, not be tested on this at the end of semester...
We should start with a few statistical terms that may crop up. Average can actually mean one of three things: the mean, the median or the mode. The mean is the sum of the data, divided by the number of items. The median is the data points that's in the middle, when you sort the data. The mode is the most common value. For example, if you had the following data:1, 2, 2, 2, 3, 3, 4, 5, 6, 7, 9
The mean would be (1+2+2+2+3+3+4+5+6+7+9) / 11 = 4. The median would be 3, because there are five items higher or equal to three, and five lower than or equal to it. The mode is 2, because there are more twos than any other number. In most cases, when we say "average," we mean the mean. Er, as it were. :-)
Correlation measures the connection between two sets of data, and varies from -1 to +1, and zero means there's no connection detectable. Say, height and weight. They are kinda linked, because taller people will weight more, but it's not perfect, as there are tall, skinny people and short, fat ones. Correlation for that might be 0.7. A negative correlation means that as one set of data increases, the other decreases. Say, temperature and amount of clothing worn - the warmer it gets, the less people put on.
Percentiles and quartiles. Knowing that .269 is the average batting average is useful, but how good is a .300 average? Well, you can plot batting average last year and find out that 90% of hitters will bat .299 or less. Put another way, .299 marks the ninetieth percentile for BA. If you bat above that, you're in the top ten percent of hitters (for batting average, at least). You may also see the 25th percentile referred to as the bottom quartile and the 75th percentile as the top quartile, as they mark the boundary for the bottom and top quarter of data respectively.
Sample size. The bigger the sample size, the more accurate results will be: over a short span, luck can induce wild variations. If you flip a coin ten times, the odds of getting seven or more heads is 17.2%; but if you flip it 100 times, the odds of 70 or more heads is virtually zero. Same with batting. A .250 batter has a 15% chance of hitting .300 in 100 at-bats, purely by luck; over 500 ABs, however, the odds are less than one in a hundred. Knowing what counts as a meaningful sample size will help you work out whether a number means anything.
baseball-reference.com. This site is the Mecca for baseball statistics, containing just about everything you could want, broken down in about a billion different ways. For our articles, we'll be mainly using as an example the 2010 Diamondbacks page, which gives you all the numbers for our players. We'll explain the numbers found there, tell you which ones are more important than others, and what values a good player should be putting up. Some will need more explanation than others; some will be a single line or less.
Still with me? Here we go.
BATTING NUMBERS
G = Number of games played. Doesn't matter whether you started, came in later or even if you got to bat. If you're announced, chalk one up. Just ask Robin Yount's brother Larry, a pitcher who injured himself while throwing warm-up tosses in his major-league debut, and never got to play again. He's still listed officially as having appeared in one game.
PA = Plate appearances. A player is credited with one of these, each time he completes a turn batting. Whether he walks, gets a hit, is hit by a pitch, makes an out: it doesn't matter. It's all counted as a plate appearance.
AB = At-bats. These are a restricted version of PAs - some PAs don't count as an at-bat. The most common cases which aren't counted are when the batter walks, is hit by a pitch, or puts down a bunt or hits a fly-ball which advance a base-runner. Those are all counted as PAs, but not ABs.
R = Runs. Every time a player crosses home-plate. They aren't much use as a measure of a player's own skill, because they are too dependent on other factors. Virtually, the only time a player will score a run on his own is with a homer; otherwise, he depends on something else happening, e.g. hit, wild pitch, etc. to bring him home. And all runs are "equal": if you get plunked and the next guy hits a homer, you get one run, exactly the same as if you tripled then stole home. Which hardly seems fair, does it?
H = Hits.
2B = Doubles.
3B = Triples.
HR = Home Runs. Largely self-explanatory. Hits are good, extra-base hits [doubles, triples and home-runs] are better. Double and triples can be an indicator of a player's speed, but you should use caution, as the park in which you play can heavily affect these, just as they affect home-run numbers. Chase Field, for example, saw 46 triples hit there last year. That's second only to Coors in the NL (50) and more than three times as many as the fifteen hit at Dodger Stadium.
RBI = Runs Batted In. You'll also see RBIs, though technically that's wrong, since RBI is already plural. Anyway, grammar Naziness aside, when a run is scored as the result of a player's action, they are credited with an RBI. If they get a hit, sacrifice fly, or walk which leads to their team scoring, they get an RBI. If they ground out and a run scores, they get credit for that too. You do not get an RBI if you ground into a double-play, or if the run scores as the result of a fielding error.
RBI, along with batting average and HR, form the Triple Crown, a very rare feat earned by a player when he leads his league in all three categories: it hasn't been done since Carl Yastrzemski of the 1967 Red Sox. However, the same goes for RBI as for Runs: if you come up with the bases empty, the only way to get an RBI is with a home-run. But if the bases are loaded, a bloop single could get you twice as many. A player's RBI number is largely determined by how good his team-mates are at getting on base in front of him. So while RBI are nice, exercise caution in using the number as proof of greatness.
SB = Stolen Bases.
CS = Caught Stealing. It's important to look at both numbers, because few things are worse than getting caught - you do the hard part, by getting on base, then give the opposition an out. The break-even point is very high: overall, you need to succeed about 70% of the time to have a positive impact [it varies from situation to situation: if you're down by one run with no outs in the ninth, it's less]. Washington's Nyjer Morgan stole 34 bases last year, tied for third in the NL. But he was caught 17 times, so succeeded only 67% of the time, and overall, probably hurt the Nats more than he helped. In general, SB must be more than twice CS, at the very least, to be considered positive.
BB = Bases on Balls.
SO = Strike-outs. The "do strikeouts matter?" argument is an interesting one, too deep to get into here. The case against can be found in a previous exploration of the topic, but shoewizard also wrote about the counter-argument, suggesting too many strikeout-prone players on one team is dangerous. Overall in the 2010 National League and excluding pitchers, batters struck out a bit more than twice as often as they walked (2.09), with the D-backs' ratio 2.44. On an individual level, two qualifying NL batters (502 or more PAs) last year had more walks than K's: Albert Pujols and Jeff Keppinger; Ronny Cedeno was the sole man with four times as many K's as BB (106:23).
TB = Total Bases.
GDP = Double Plays Grounded Into.
HBP = Hit By Pitch.
SH = Sacrifice Hits.
SF = Sacrifice Flies.
IBB = Intentional Base on Balls. Just to tidy up, these are the minor categories listed, but you won't find them used very often in statistical argument.
BATTING STATISTICS
BA = Batting Average. Hits divided by at-bats. Simple, huh? The most well-known mark of hitting skill, and in the 2010 NL, for players with 150 PAs or more, it ranged from .336 (Carlos Gonzalez) to .181 (Garret Anderson), with the median Emilio Bonifacio's .261. If you hit .299 you'd be at the ninetieth percentile and .281 puts you in the top quartile. At the other end, .246 marks the bottom quartile, and .217 would put you in the bottom 10%. However, as a standalone figure, it doesn't tell you anything about the player's power, since singles and home-runs are counted the same, and it also omits walks entirely from the equation.
OBP = On-base Percentage. The formula here is a bit more complex: (H + BB + HBP) / (AB + BB + HBP + SF). The range is generally from .200 to .400 - only five NL batters were above .400 last year, led by Joey Votto's .424. The all-time high is the insane .609 by Barry Bonds in 2004. During 2010, the median was .327; Stephen Drew's .352 puts him in the top quartile, and .378 is the ninetieth percentile. While rare, it is possible for a hitter to have a lower OBP than BA, if you have more sacrifice flies than walks.
SLG = Slugging Percentage. It's like batting average, but rather than all hits counting the same, a single counts as one, a double as two, while a triple and home-run are three and four. The total over a season is divided by the at-bats: if you like, it's the average number of bases a batter produces per at-bat. The average is currently round about .400; conveniently, .350 is the bottom quartile, 450 the top quartile, and .500 the ninetieth percentile. In career terms, Albert Pujols .624 is the leader among active players, and trails only Babe Ruth, Lou Gehrig and Ted Williams all-time.
OPS = On-base plus Slugging Percentage. I trust I need not say how this is calculated. :-) This number became popular after its use in 1984 by John Thorn and Pete Palmer, in their (excellent) book, The Hidden Game of Baseball. It's important, because it is simple, but does a great job of combining all the important numbers - not just batting average, but walks and power - into one. League median last year was Cody Ross's .735. Justin Upton's .799 just missed out on the upper quartile, while Kelly Johnson's .865 put him in the top 10%. If you can crack .900, you're an All-Star; reach 1.000, and MVP beckons. Two did the latter in 2010: Votto and Pujols.
OPS+ = Adjusted OPS. Not all parks are equal. And not all seasons are equal. OPS+ is an effort to take those factors into account. I won't even get into the formula, because it doesn't matter. What you should know, is that 100 is league average for the time, after adjusting for park factors [we'll get in to those in part three, but for now, they measure the extent to which Chase is more hitter friendly than Petco, and so on]. Votto's 174 was best in the league; the ninetieth percentile came in at 131, and the top quartile at 114. The lower quartile started at 79, and the tenth percentile was a lowly 67. Every point above or below 100 is one-half percent better or worse than average.
EXTRA CREDIT
Understanding the above, and using them correctly, will get you through the vast majority of discussions, and make you look really, really smart. If you plan to get even deeper into the Matrix, here are some other terms you might hear.
wOBA = Weighted On-Base Average. One of the problems with OPS is that it values OBP and SLG the same, even though it has been shown that OBP correlates better to runs scored. wOBA attempts to address that by adjusting its formula. The scale is the same as for OBP, and you can find numbers on Fangraphs.com, which uses it a lot in its calculations of player value.
RC = Runs Created. This tries to measure how many runs a player created for his team - which is, after all, the point of the exercise, rather than walks, hits or any other stat in isolation. It was created by stats guru Bill James, and has been shown to be pretty good - usually within 5% - of predicting the actual runs a team will score. Kelly Johnson led the Diamondbacks last year, with 109 RC; Drew and Chris Young were both in the nineties.
LD% = Line-drive percentage. The percentage of all balls put into play that are line-drives. Line-drives are good, because they are far more likely to become hits than fly-balls or ground-balls. League average last year was 19% - Mark Reynolds' struggles were largely because his number was dead last among qualifying NL batters, at only 13%. That's why his average was so low: he was hitting fly-balls and ground-balls instead. Whether he turns things around in Baltimore will likely depend on whether his LD% gets back to where it was.
BABIP = Batting Average on Balls in Play. We'll discuss this more in pitching, but the basic principle is that, after the ball leaves a hitter's bat and stays in the park, whether it becomes a hit or not is mostly chance, outside the hitter's control. .300 is league average; batters who hit a lot of line drives will see it higher than that, but if a hitter has a high BABIP, this can suggest he has been lucky, with balls finding holes and dropping into gaps. If so, then his numbers might be likely to drop going forward. Conversely, a low BABIP can suggest he has been hitting balls at people, and similarly, that won't last forever.
Next week, we'll look at pitching numbers.