A report was published in March 2016 that laid the blame at an imbalance in the polling samples: too many younger, left leaning people who were less likely to vote on the day; not enough older or busier people, both groups statistically more inclined to vote Conservative. Another panel of experts raised the possibility of ‘herding’ – none of the companies wanting to provide outlier results and so all consciously or unconsciously crafting their surveys and samples to produce similar outcomes. They stressed, though, that they were not suggesting malpractice.
The industry took a battering and much soul searching has gone on since.
Could technology solve all their problems? We explore how research and development (R&D) offers opportunity for pollsters to change the way they collect and interpret information. Big data, AI, quantum and classical computing are important themes, but as 2015 showed there is a lot of ground to make up. Let’s explore those next generation polling techniques.
A warning from history
More of an amusing anecdote than a warning really. And when we say ‘history’ we mean two years ago. But if you think that the silver bullet is just to pull a load of data out of Google or Twitter and crunch the numbers, consider this:
In the build up to the Scottish Independence Referendum, The Bank of England was concerned that there may be a run on the Scottish banks if independence was chosen. They undertook a project to examine tweets for their value in measuring real-time sentiment. A tool was constructed that filtered relevant keywords, two of which were ‘run’ and ‘RBS’. Three days before the referendum, the warning signals started flashing. There had been a huge spike in ‘run’ and ‘RBS’. Were their worst fears being realised?
…Actually no. It was a conversation about American Football team the Minnesota Vikings and their Running Backs or RBs for short!
Big data and election polling
Big data is, without doubt, going to be important in polling of the future, but the Minnesota RBs conundrum hints at the complexity of working with new methods.
So aside from such semantic problems, what other technical challenges need to be overcome when analysing big, unstructured data from social media and other online sources?
- Weighting samples appropriately – not everyone has access to the Internet and social media is particularly slanted towards younger generations. Perhaps this problem will become less severe as time goes by and the digital generation work their way through the age cohorts. In many ways this is the same problem as in classical polling: how do you construct a fair sample group that will extrapolate to the whole electorate? Unlike times gone by, the answers will increasingly lie in code, software and algorithms which are ripe areas for R&D tax credits.
- Fickle trends – if pollsters have to be weary of who they are polling, they also need to factor into their analysis the fickleness of much of the content. Something that goes viral one day and appears to have tremendous significance could be old news or irrelevant come the time of a vote. Does anyone remember the #millifans?
- Processing power – the clue here is in the name: ‘big data’. It’s BIG…there is a lot of it to get through. And a huge amount of processing power is required to interrogate it. This is one area where quantum computers could revolutionise polling. Quantum computers can process exponentially more data than classical computers by using a key quantum principle known as superposition. This is expected to be extremely relevant to solving problems that rely on running lots of variables: weather forecasting, traffic flow management and polling. Problems that would take the very best computers of today decades to solve could be calculated in seconds. Only unstable prototypes are available at present however.
How did an actual big data forecast do in the 2015 General Election?
For this we turn to an academic study by MacDonald and Mao. Their paper outlines big data forecasting work they did on both the Scottish Referendum and the 2015 General Election. The paper explains their methodology which involved looking at Google keyword trends using a TRUST framework – Topic Retrieved, Uncovered and Structurally Tested.
Most interestingly, their forecast was more accurate than the traditional surveys: coming out with a mean prediction of 318 conservative seats – still a little short of the total, but certainly not the hung parliament that was the consensus opinion.
With consideration for the future they outlined six areas for improvement on what they achieved. These are interesting areas of R&D for the polling and big data industries.
- Dynamic text analysis of newspaper articles and TV programme scripts using machine learning and natural language processing.
- A wider spread of data mining techniques could be applied to multi-media and mass media sources.
- Use of text mining from social media sources in addition to Google.
- Devising ways of overlaying localised information to internet big data so that forecasting can be accurately drilled down to a constituency and therefore seat level. This is highlighted in other reports as being a major obstacle.
- Writing the latest voting theory into the modelling.
- Ensuring the latest modelling techniques are applied.
More technological challenges to the polling industry
We often think of technological advances making our lives easier. But the march of progress has caused considerable problems for pollsters. We have already touched on the allure of the Internet as a source of data, but for pollsters it comes with the age old problem of finding representative samples. More tricky has been the rise of mobile phones. The New York Times zeroed in on this, although of course relating to polling in the USA. They cited the rise in people using solely ‘cell’ phones (they are American, bless them), as going from 6% to 43% in 10 years to 2014. This poses a problem in America because Federal Law prohibits ‘auto-dialling’ to cell phones (but not landlines).
To carry out a 1,000-person poll, 20,000 telephone numbers may have to be dialled. Stick to autodialling landlines and they miss about half of the electorate. Manually dial cell phones and the costs rise significantly. The problem of calling such vast numbers of people has been cited in England too, where 15% to 20% of polling calls made are to mobiles with the rest being to landlines. Here 20,000 to 30,000 calls may have to be made to get a 2,000-person survey.
Machine learning and election polls
Another piece of the technology jigsaw that is starting to inform election polling is machine learning. Huge companies like Google and IBM take great interest in this area and have initiatives like Deep Mind and Watson respectively, but we are going to look at an Oxford University spin-out called TheySay.
TheySay has developed a hybrid engine with advanced linguistic algorithms coupled with machine learning that can analyse sentiment in big data sources like social media, emails and blogs. Their service can study, comprehend and evaluate opinions and emotions expressed in text and turn it into quantifiable data.
One of the key machine learning challenges they have overcome is to produce a system that can be transparently interrogated, so that each AI decision made can be monitored, reviewed and adjusted if necessary. They suggest that other statistical machine learning systems can reach a certain level of accuracy but cannot be verified or corrected as theirs can.
All this can help inform businesses on consumer sentiment, traders on stock market movement and you guessed it… people interested in the outcome of elections. It’s all clever stuff but as we have seen before, does not necessarily answer the $64,000 question.
After the televised election debates last year, TheySay tracked tweets on the performance of David Cameron and Ed Milliband. 210,864 tweets were analysed for Cameron and 284,896 for Milliband. The Labour leader was adjudged to have won with 52% positive tweets and 48% negative whilst Cameron had 47% positive and 53% negative. All well and good, and there is no cause to call the data analysis into question, but this certainly did not correlate to the election result. This suggests that there is plenty of room for R&D that improves how data is framed and interpreted after it has been analysed, as well as focussing on the analysis itself.
There are many areas of R&D in machine learning other than linguistic analysis and algorithms, including developing APIs, logic rules, topic classification and taxonomic categorisation. As in-house and freelance staff costs can qualify for R&D tax credits as well as consumables like lighting and heating the value of the tax credit to companies in this sector is huge. For a more in depth look at R&D in machine learning and AI go here.
Technology a two edged sword for the polling industry
You could say that polling is one industry in which innovation has caused more problems than it has solved. Right now the pollsters appear to be in the opposite of a sweet spot. Will 2015 prove to have been the nadir? Eventually, as technologies converge, R&D in big data, quantum computing and machine learning will have a big impact in polling. But to return to the original question “Will technology help the polling companies get the EU Referendum result right?”, it’s probably too early for its influence to have a major impact. But some of the non-technological lessons learned from 2015 will surely see them come a bit closer!
Are you innovating in the polling industry?
Technological innovation in the polling industry is likely to involve cutting-edge R&D. If you haven’t checked out R&D tax credits and you are a UK company, you should contact us. ForrestBrown’s expertise (and 100% success rate) could help you recoup as much as 1/3rd of your development costs.