**Editor’s note**

*I have strong views on statistical “significance”, confidence intervals, p-values, and all that, finding them harmful and worse. However, Kossin used these concepts, and a criticism of his methods on his terms has value. There’s much more to Kent’s critique than just that. It is a brilliant and devastating refutation of Kossin’s claims. Science at its best.*

In mid-May 2020, the Proceedings of the National Academy of Sciences (PNAS) published an article on hurricane trends over the past 39 years. That study, Kossin et al 2020, found “a clear shift toward greater intensity” storms that is manifested in the observational record as “increased probabilities of exceeding major hurricane intensity.” Both theory and climate models have long projected a shift to a higher proportion of high intensity storms but support for this theory has been difficult to detect in the observational record. Kossin et al 2020 provided a new homogenized dataset of hurricane wind intensity estimates for global tropical cyclones from 1979-2017. Based on this new dataset, the article claimed that a statistically significant shift to more intense storms has finally been detected in the observational record. By identifying “significant global trends in [tropical cyclone] intensity over the past four decades,” as projected by the models, the paper’s findings were said to “increase confidence in projections of increased [tropical cyclone] intensity under continued warming.”

The mainstream media trumpeted the results of this new study to a broader audience. The New York Times stated that “Climate Change Is Making Hurricanes Stronger.” Similarly, the Miami Herald urged south Floridians to “Expect stronger, deadlier, more frequent hurricanes in the years to come.” The Washington Post headline claimed that “The strongest, most dangerous hurricanes are now far more likely because of climate change.” As if the headline wasn’t scary enough, the Post’s article warned the public: “With powerful hurricanes on the increase, one can expect damage costs, in dollar terms and in potential loss of life, to skyrocket.” The lead author of the study, Dr. James Kossin, told the Post that “we have high confidence that there is a human fingerprint on these changes.” All of this was quite worrisome news, coming as it did just before hurricane season started.

The problem is that most of the claims of these newspapers simply weren’t true. Unfortunately, the Kossin et al 2020 paper contained a rather significant error: the paper reported erroneous values for the “metric of interest” on which the paper’s conclusions rest. Once corrected, the purported increase is no longer statistically significant. If that wasn’t bad enough, the media hyped-up the results of the paper to reach conclusions that are far more worrisome than what the paper actually found.

SUMMARY OF THE PAPER

Kossin et al 2020 updated a homogenized hurricane intensity database based on satellite imagery. The standard “best track” dataset is based on observational records with known variations across timeframes and regions, so it is difficult to make long-term comparisons. The new homogenized dataset provides estimates for storm intensity based on satellite imagery. Although these are just estimates (placed in 5-knot bins), they are at least consistent and allow apples-to-apples comparisons across basins and timeframe for the 39 years for which available data exists. The dataset contains hurricane wind speed estimates every 6 hours during the lifetime of each storm. These every-six-hour measurements of wind speed in knots are known as “fixes.” Essentially, the Kossin study split the 39-year record into two periods (1979 + 1981-1997 vs. 1998-2017) and counted the number of “fixes” in each period that are greater than or equal to 65 knots (indicating a hurricane) and the number of “fixes” that are greater than or equal to 100 knots (indicating a major hurricane of Category 3 or higher). The study then used these two counts to create a proportion: the percentage of all hurricane-strength “fixes” that are at major hurricane intensity. This proportion of 6-hour wind speed measurements is the study’s “metric of interest.”

The bottom-line finding of the study is that that proportion of 100 kt or greater intensity “fixes” increased from 27% (CI=[25%, 28%]) of all hurricane-force fixes in the early period to 31% (CI=[29%, 32%]) in the later period. Because the confidence intervals don’t overlap, the increase in the proportion of major hurricane intensity wind speeds is statistically significant. Hence, the paper declared that the long-expected shift to more intense storms was finally observed in the record. As will be seen, it is critical to keep in mind what this “metric of interest” really says. It is strictly about the proportionate share of all 6-hour intensity measurements. It is a relative measurement, not a measure of absolute numbers.

IRREPRODUCIBLE RESULTS

Unfortunately, the results reported in the paper are inconsistent with the underlying dataset that was supplied with the paper’s supplemental information. To derive the “metric of interest,” the paper reported a total of 2362 major-hurricane-force wind speed measurements out of a total of 8848 total measurements of hurricane wind speed for the period 1979 + 1980-1997. These numbers are incorrect. The correct counts, which are easily derived from the supporting data on the PNAS website, are 3202 major-hurricane-force winds speeds and 9420 total hurricane-level wind speeds. The counts are similarly erroneous for the later period 1998-2017. Since the counts were wrong, the proportions used in the study were also wrong, as was the change in the proportion between the early and later period. The original paper reported that major storm wind speeds accounted for 27% of all hurricane wind speeds in the early period but rose to 31% in the later period, or a 15% increase. In fact, the proportion of major hurricane wind speeds only increased 10%, rising from 34% in the early period to 37% in the later period. I notified Dr. Kossin of these error before Memorial Day, and he replied shortly thereafter confirming the errors and indicating that an erratum would be issued. In November, a Correction was published in PNAS acknowledging the original paper contained calculation errors due to a shift in the research team’s code. After correcting for that error, the Correction reports that the proportion of major hurricane wind speeds had in fact only increased by 10%, rather than the 15% originally reported.

*Table A. Shows key metrics (including confidence intervals) from the original paper and the correction published in PNAS, as well as metrics recalculated from the dataset.*

The Kossin et al correction revised several sentences in the paper and completely revised Table 1, which contains the detailed numeric results of the study globally and for each hurricane basin individually. Unfortunately, the revised Table 1 still contains errors. Kossin et al failed to change the global results for the late period so that the erroneous values from the original paper (N_{tot} = 9275 and N_{maj} = 2842) still exist in the Correction’s version of the table. It is clear that this error was a minor oversight because the proportion, which is the “metric of interest”, for the late period was updated to the correct value (e.g., P_{maj} = 0.3725), and each of the individual basins was properly updated (as will be seen below in Table C).

STATISTICAL SIGNIFICANCE, OR NOT?

More important is the claim in Kossin et al’s Correction that “none of the errors alter any of the key results or messages of the manuscript.” In other words, the Kossin et al Correction continues to maintain that the difference in the proportion of major wind speed remains statistically significant. Statistical significance is of primary importance to the value of the paper, since Kossin et al is the first paper to claim to have found a significant increase in hurricane intensity in the observational record. Unfortunately, there is reason to suspect that the results are not actually significant. As noted previously, the change in the proportion of global major hurricane winds was erroneously reported in the original article as increasing by 15%, which just barely crossed the threshold for statistical significance (e.g., the upper bound of the early period was 28% and the lower bound of the late period was 29%). With the correct calculations, the real difference between periods is significantly smaller (only 10%), which implies that the statistical significance of the change would likely decline. Since the original confidence intervals were contiguous, it would seem doubtful that sufficient wiggle room would exist to continue to claim statistical significance.

Kossin et al attempted to supply some wiggle room by adding two extra decimal places to their confidence intervals shown in Table 1 of the Correction. The Correction shows the upper bounds of the early period as 35.55% and the lower bounds of the late period as 35.59%. If the conventions of the original paper had been followed, these both would round to 36%, showing overlapping confidence intervals and rejecting statistical significance. By adding two additional decimal places to their calculations in the Correction, Kossin et al can show a miniscule gap of 4/10,000^{ths} between the confidence intervals in order to claim statistical significance.

However, the confidence intervals shown in the Correction appear to be erroneous. The original paper describes the method for calculating the 95% confidence intervals. The standard process for calculating confidence intervals of proportions is followed with one exception. The full sample size of hurricane wind speed measurements cannot be used in the calculation of the standard error because the data points are not independent. According to the “Methods,” the degrees of freedom “are reduced by a factor of 3, which assumes a decorrelation of 18 hours.” In other words, testing for statistical significance uses 1/3 of the population size when deriving the standard error.

When I calculate the confidence intervals following this methodology, the results are different from what is shown in the Correction. The actual confidence intervals of the two periods overlap. The upper bound of the early period is 35.65% and the lower bounds of the late period is 35.49%, which means that the confidence intervals overlap by 16/10,000^{ths}. Overlapping confidence intervals, of course, means that the difference in the proportions is not statistically significant.