Front Page Clues

They say the best place to hide a dead body is on page two of the Google search results. I’d argue that a similar rule applies to reading the news, especially online. If a story is not on the landing page of whatever news site I’m looking at, chances are I’m not gonna find it. All this is to say: news outlets wield considerable power to direct our attention where they want it simply by virtue of how they organize content on their sites.

During presidential elections, the media is often criticized for giving priority to the political horse race between dueling candidates, preoccupying us with pageantry over policy. But to what extent is this true? And if it is true, which specific policy issues suffer during election cycles? Do some suffer more than others? What are we missing out on because we are too busy keeping up with the horse race instead?

If you want to go straight to the answers to these questions (and some other interesting stuff), skip down to the Findings section of the post. For those interested in the technical details, the next two sections are for you.

Data

Lucky for us (or maybe just me), the New York Times generously makes a ton of its data available online for free, easily retrievable via calls to a REST API (specifically, their Archive API). Just a few dozen calls and I was in business. This amazing resource not only has information going back to 1851 (!!), it also includes keywords from each article as part of its metadata. Even better, since 2006, they have ranked the keywords in each article by their importance. This means that for any article that is keyword-ranked, you can easily extract its main topic—whatever person, place, or subject it might be.

Having ranked keywords makes this analysis much easier. For one thing, we don’t have to sift through words from mountains of articles in order to surmise what each article is about using fuzzy or inexact NLP methods. And since they’ve been ranking keywords since 2006, this gives us three presidential elections to include as part of our analysis (2008, 2012, and 2016).

The other crucial dimension included in the NYT article metadata is the print page. Personally, I don’t ever read the NYT on paper anymore (or any newspaper, for that matter—they’re just too unwieldy), so you might argue that the print page is irrelevant. Possibly, but unfortunately we don’t have data about placement on the NYT’s website. And moreover, I would argue that the print page is a good proxy for this. It gets at the essence of what we’re trying to measure, which is the importance NYT editors place on a particular topic over others.

Model

$logit(\pi_t) = log(\frac{\pi_t}{1-\pi_t}) = \alpha + \sum_{k=1}^{K} \beta_{k} * Desk_k + \beta * is\_election$

A logistic regression model underpins the analysis here. The log-odds that a topic $\textit{t}$ will appear on the front page of the NYT is modeled as a function of the other articles appearing on the front page (the $Desk$ variables, more on those below), as well as a dummy variable indicating whether or not the paper was published during an election cycle.

Modeling the other articles on the front page is essential since they have obvious influence over whether topic $\textit{t}$ will make the front page on a given day. But in modeling these other articles, a choice is made to abstract from the topics of the articles to the news desk from which they originated. Using the topics themselves unfortunately leads to two problems: sparsity and singularity. Singularity is a problem that arises when your data has too many variables and too few observations. Fortunately, there are statistical methods to overcome this issue—namely penalized regression. Penalized regression is often applied to machine learning problems, but recent developments in statistics have extended the methodology of significance testing to penalized models like ridge regression. This is great since we are actually concerned with interpreting our model rather just pure prediction—the more common aim in machine learning applications.

Ultimately though, penalized methods do not overcome the sparsity problem. Simply put, there are too many other topics that might appear (and appear too infrequently) on the front page to get a good read on the situation. Therefore as an alternative, we aggregate the other articles on the front page according to the news desk they came from (things like Foreign, Style, Arts & Culture, etc). Doing so allows our model to be readily interpretable while retaining information about the kinds of articles that might be crowding out topic $\textit{t}$ .

The $is\_election$ variable is a dummy variable indicating whether or not the paper was published in an election season. This is determined via a critical threshold illustrated by the red line in the graph below. The same threshold was applied across all three elections.

In some specifications, the $is\_election$ variable might be broken into separate indicators, one for each election. In other specifications, these indicators might be interacted with one or several news desk variables—though only when the interactions add explanatory value to the overall model as determined by an analysis of deviance.

Two other modeling notes. First, for some topics, the model might suffer from quasi or complete separation. This occurs when for example, all instances of topic $\textit{t}$ appearing on page one occur when there are also less than two Sports desk articles appearing on page one. Separation can mess up logistic regression coefficient estimates, but fortunately, a guy named Firth (not Colin, le sigh) came up with a clever workaround, which is known as Firth Regression. In cases where separation is an issue, I switch out the standard logit model for Firth’s alternative. This is easily done using R‘s logistf package, and reinforces why I favor R over python when it comes to doing serious stats.

Second, it should be pointed out that our model does run somewhat afoul of one of the basic assumptions of logistic regression—namely, independence. In regression models, it is regularly assumed that the observations are independent of (i.e. don’t influence) each other. That is probably not true in this case, since news cycles can stretch over the span of several newspaper editions. And whether a story makes front page news is likely influenced by whether it was on the front page the day before.

Model-wise, this is a tough nut to crack since the data is not steadily periodic—as is the case with regular time series data. It might be one, two, or sixty days between appearances of a given topic. In the absence of a completely different approach, I test the robustness of my findings by including an additional variable in my specification—a dummy indicating whether or not topic $\textit{t}$ appeared on the front page the day before.

Findings

For this post, I focused on several topics I believe are consistently relevant to national debate, but which I suspected might get less attention during a presidential election cycle. It appears the 2016 election cycle was particularly rough on healthcare coverage. The model finds a statistically significant effect $(\beta = -4.519; p = 0.003)$ , which means that for the average newspaper, the probability that a healthcare article made the front page dropped by 60% during the 2016 election season—from a probability of $0.181$ to $0.071$ . This calculation is made by comparing the predicted values with and without the 2016 indicator activated—while holding all other variables fixed at their average levels during the 2016 election season.

Another interesting finding is the significant coefficient $(p = 0.045)$ found on the interaction term between the 2016 election and articles from the NYT’s National desk, which is actually positive $(\beta = 0.244)$ . Given that the National desk is one of the top-five story-generating news desks at the New York Times, you would think that more National stories would come at the expense of just about any other story. And this is indeed the case outside of the 2016 election season, where the probability healthcare will make the front page drops 45% when an additional National article is run on the front page of the average newspaper. During the 2016 election season, however, the probability actually increases by 25%. These findings were robust to whether or not healthcare was front page news the day before.

The flip on the effect of National coverage here is curious, and raises the question as to why it might be happening. Perhaps NYT editors had periodic misgivings about the adequacy of their National coverage during the 2016 election and decided to make a few big pushes to give more attention to domestic issues including healthcare. Finding the answer requires more digging. In the end though, even if healthcare coverage was buoyed by other National desk articles, it still suffered overall during the 2016 election.

The other topic strongly associated with election cycles is gun control $(\beta = -3.822; p = 0.007)$ . Articles about gun control are 33% less likely to be run on the front page of the average newspaper during an election cycle. One thing that occurred to me about gun control however is that it generally receives major coverage boons in the wake of mass shootings. It’s possible that the association here is being driven by a dearth of mass shootings during presidential elections, but I haven’t looked more closely to see whether a drop off in mass shootings during election cycles actually exists.

Surprisingly, coverage about the U.S. economy is not significantly impacted by election cycles, which ran against my expectations. However, coverage about the economy was positively associated with coverage about sports, which raises yet more interesting questions. For example, does our attention naturally turn to sports when the economic going is good?

Unsurprisingly, elections don’t make a difference to coverage about terrorism. However, when covering stories about terrorism in foreign countries, other articles from the Foreign desk significantly influence whether the story will make the front page cut $(\beta = -1.268; p = 8.49e-12)$ . Starting with zero Foreign stories on page one, just one other Foreign article will lower the chances that an article about foreign terrorism will appear on page one by 40%. In contrast, no news desk has any systematic influence on whether stories about domestic terrorism make it on to page one.

Finally, while elections don’t make a difference to front page coverage about police brutality and misconduct, interestingly, articles from the NYT Culture desk do. There is a significant and negative effect $(\beta = -0.364; p = 0.033)$ , which for the average newspaper, means a roughly 17% drop in the probability that a police misconduct article will make the front page with an additional Culture desk article present. Not to knock the Culture desk or nothing, but this prioritization strikes me as somewhat problematic.

In closing, while I have managed to unearth several insights in this blog post, many more may be surfaced using this rich data source from the New York Times. Even if some of these findings raise questions about how the NYT does its job, it is a testament to the paper as an institution that they are willing to open themselves to meta-analyses like this. Such transparency enables an important critical discussion about the way we consume our news. More informed debate—backed by hard numbers—can hopefully serve the public good in an era when facts in the media are often under attack.

Data

Model

Findings

Comments

Leave a Reply Cancel reply