When you create a scattergram, the most common additional step of analysis and visualization is to superimpose on it a trend-line. If you use the default trend-line function of Excel, the line will be straight (known as the least squares line), and it will generate an equation for the line. A straight line, however, may convey a distorted sense of the trend of the points on a scattergram that has some unusually high or low numbers or anomalies in the middle of fairly consistent numbers. True, you can selectively delete the outliers, but doing so leaves you vulnerable to charges of data manipulation.

Alternatively you would rather have the plot of points reveal its pattern with a line that more snugly fits the points. Called a smoothing function, various programs can create such a smooth line—one that may have bumps and squiggles in it. One advantage of using a smooth, rather than a more traditional linear fit, is that smoothed lines are local. The effects of some outlier points on a smooth fit affect only those parts that fit near those points with a linear method, whereas outliers distort the entire straight line. The slope—the equation generated—looks at all the data points as equally influential.

One particular method of smoothing a set of data points is known as “53h.” 53h is actually just a descriptive designation for many classes of smoothers. Let me explain this particular one. You take the medians of every five points, which in itself, if plotted, could yield a considerable smoothing effect. Next, you continue the process by plotting every three of those medians. The “h” in 53h is a specific linear combination of the three: the original data, the every-fifth medians and the medians of every third median thereafter. The smoothing vector that results can then be subtracted from the original data and the process repeated on the residuals. But that is getting beyond what I think I understand.

Statisticians and visual data analysts use variations of the 53h smoothing method. They might select every sixth data point, and then every fourth, but the outcome is similar. All the data you have available can be incorporated, but extreme or odd values will only disturb a small, local portion of your overall smooth curves. A line makes it easy to extrapolate to some point, but that may be misleading.

You do need to have a fair amount of data for smoothing to work. Say you placed against one axis the number of lawyers in each law firm that you paid last year and on the other axis the amount you paid the firm. Many law departments would have more than 100 data points, and a smoothed function would effectively make visible and convey the whole data pattern. Try it yourself with the “scatter with smooth lines” function in the graphing section of Excel.

If you want a very sophisticated explanation and example of smoothing scatter-plot data, visit this website.

## Join the Conversation