The Optimistic Leftist: The Truth About the Exit Polls and Possible Alternatives

By now, we know beyond a shadow of a doubt that exit polls provide very unreliable information about the composition of the electorate and the voting preferences of different demographic groups. The exit poll folks say they're cleaning up their act but I'm not holding my breath.

Yair Ghitza, chief data scientist at Catalist, runs down some of the most notorious problems and describes how his firm hopes to provide an alternative to the exits (along with AP/NORC who are implementing another new system called VoteCast). From his lengthy, but well worth reading, article on Medium:

"We’ve recently developed estimates for an “exit-poll-style” demographic analysis of past elections since 2008, and we’re planning on releasing estimates for this midterm cycle shortly after the 2018 election. Data will be available for select states and congressional districts, possibly including a national estimate....

Like...other datasets....the voter file isn’t perfect. But we think it is an ideal starting point for understanding what happened in past elections. The main advantage of the voter file is that we start with a rich and detailed view of the composition of the electorate:

- The voter file tells us exactly who voted. We don’t rely on self-reported survey data on turnout, which is notoriously inaccurate. We don’t rely on precinct sampling, and we don’t have to recruit survey respondents, which may or may not be representative of the true voting population. We start with the full list of everybody who voted, as was officially recorded by each Secretary of State.

- Some demographics like age and gender have close to full coverage for voters, because these are almost universally used in voter registration forms. For these two, we’re confident that voter file estimates are the most accurate available[3]. For other demographics like race and education, self-reported data isn’t available in all geographies across the country. The voter file does have a lot of other information which is helpful for estimating these demographics. Our process starts with large-scale machine learning models to identify the probability that each voter has a certain racial background, for example, based on whether (s)he has a particularly ethnic name, neighborhood, and other characteristics. Because we have precise geocodes for the vast majority of voters in our database, we can compare and calibrate these estimates to census and other outside data sources to remove inaccuracies that are introduced using standard modeling methods. Critically, we build the models in such a way that even if we can’t precisely identify all demographics at the individual level, our estimates are accurate at the aggregate population level. We can’t claim that our estimates are 100% accurate, but we do feel that the effort spent here makes these into very high quality, defensible estimates when compared to anything else that we’ve seen.

- Lastly, we’ve been collecting voter files in a consistent format since at least 2008 in every state in the country. This gives us a relatively long history of data, from which to build our models and calibrate our methods.

This gives us a great foundation for understanding the composition of the electorate. To understand candidate choice, we combine the voter file data with survey data using a statistical technique called Multilevel Regression and Poststratification (MRP). MRP combines flexible statistical models with large population datasets to provide more reliable estimates for small subgroups where standard survey methods don’t have enough sample size to work properly. While MRP is a general method that has shown promising results and is becoming increasingly popular (with some skeptics), we think it is ideally suited to be used with voter files, due to the large amount of detailed data there. We developed some of the specifics about this in an academic context starting in 2013, and have since improved on those methods in various ways for this project.

How does our data compare to some of the more familiar public sources, particularly the dominant exit poll?....Some of the consistent trends are:

- Our data shows an electorate that is consistently less college educated (34% in 2016, compared to 50% for the exit poll), older (25% aged 65+, compared to 16%), whiter (74%, compared to 71%), and often has more women (53–55% across all years, sometimes it is lower for exits). These differences extend back to 2006, and they span both modeled data and data reported directly from the Secretary of State (age and gender).

- At the same time, our vote margins are often more in favor of Democrats than the exits, within each group. Some of the largest and most consistent differences are among white non-college voters (an average difference of 8 points in margin) and white women (6 points).

- Lining these data up with other data sources can reveal which estimates are plausible. For example, the exit poll shows that half of the 138 million voters in 2016 had a college degree, implying 69 million college-educated voters. The Census American Community Survey for 2016 shows there are only 66 million college-educated citizens across the country, implying a turnout rate of 105%"

Clearly there is a need for better data sources than the exits. I believe Catalist is on track to be one such source.