In the 1936 American presidential election, the magazine Literary Digest conducted the largest-ever poll at the time. It mailed out 10 million questionnaires, got a 24 per cent response rate, and confidently predicted Republican challenger Alfred Landon would defeat Democrat President Franklin Roosevelt by 57 to 43 per cent. They could not have been more wrong: Roosevelt won 62 to 38 per cent.v

Professor Shen Haipeng of the Faculty of Business and Economics likes to cite this example because it highlights a problem that exists in big data even today. The reason the pollsters got it wrong was because their sampled population was heavily weighted to wealthier and therefore Republican voters. In contrast, a much smaller but more representative poll of 50,000 people by Gallup correctly predicted the outcome.

“The message is that it’s not size that matters, but whether the data collected comes from the targeted population that you want to study. If it is, then 50,000 responses can give you the better answer than 2.4 million,” he said.

Professor Shen trained as a data scientist and has been handling big data for years in the vastly different fields of bank call centres, precision medicine and hospitals.

His early research involved working with the customer service operations of FleetBoston Bank in the US to track patterns as customers moved through the system and identify bottlenecks and imbalances. This research also revealed that employees were gaming the system to increase their tally of customers by hanging up a few seconds into calls, that VIP customers tended to be willing to wait longer when on hold, and that customers of all types tended to hang up after 60 seconds, when a recorded message asking them to keep waiting in fact reminded them that they were on hold. “This work showed how, with a large volume of data and simple analysis, you can reveal different human behaviours,” he said.

Professor Shen might have restricted his investigations to more traditional business subjects but in 2006, during a visit home to Beijing from his job in the US, his father fell ill. He realised he did not know any medical experts there who could advise him about his father’s care so when he returned to his job – at the University of North Carolina-Chapel Hill – he began to seek out potential collaborators in medicine.

[I]t’s not size that matters, but whether the data collected comes from the targeted population that you want to study. If it is, then 50,000 responses can give you the better answer than 2.4 million.

Professor Shen Haipeng

A second expertise

He attended a conference of top neurologists and offered his expertise in data analytics, and established fruitful collaborations with Chinese and American scholars in precision medicine, in particular strokes.

This work has combined targeted studies and trawls through existing data to see what could be gleaned from information on patient demographics, diagnoses, medical histories, laboratory results and so forth. “This is a second generation attempt to extract value from data already collected and leverage the big data platform. The benefit is that you increase the dimensions you measure and the sample size,” he said.

In one project, Professor Shen and his collaborators analysed multiple databases to identify stroke patients who were at risk of recurrence because their nighttime or early morning blood pressure spiked. In another project, they showed that treating stroke patients with both aspirin and blood thinner was more effective than aspirin alone. This was incorporated into medical guidelines for stroke patients in the US and elsewhere.

Professor Shen is also part of a RMB20 million national key project grant to devise a data-supported system for improving the care of stroke patients in China. The goal is to enable physicians to input patient characteristics and see which treatments have worked in patients with similar characteristics. This would be particularly helpful in lower-tier hospitals where doctors may not have the training or resources of top-tier hospitals.

He has also managed to connect his two fields of work, albeit indirectly, by applying the method he used for mapping the call centre to the movement of patients through a hospital in China. A bottleneck was identified between the emergency room and imaging department, which was subsequently found to be due to crowded corridors. The hospital opened an underground passage to the imaging department, which solved the problem.

“The overarching theme that ties my work in neurology, business analytics and machine learning together is data-driven decision-making and using that to improve the efficiency and quality of services,” he said. “Big data is the future because whoever has the best quality data will have the best AI [artificial intelligence] applications” – the accuracy of which would put the human editors of Literary Digest to shame. 

BIGGER IS NOT

ALWAYS BETTER

With big data, the quality of the input matters as much as the quantity, says Professor Shen Haipeng, who has applied data analytics to fields as diverse as bank call centres and precision medicine.

Big data view of multiple databases about stroke patients. Medical data was obtained from neurological studies in three groups.