数据扩充中的数据对抗性?_利用新颖的数据流对抗covid 19
数据扩充中的数据对抗性?
With social distancing measures in place, a large amount of discourse relating to COVID-19 now takes places on social media platforms such as Twitter. These platforms contain a treasure trove of information that can help us answer questions such as how many people are exhibiting Coronavirus symptoms today? However, not all information is created equal — these platforms also contain a lot of misinformation which could potentially cause harm to members of the public.
随着社会疏离措施的到位,有关COVID-19的大量讨论现在在诸如Twitter之类的社交媒体平台上进行。 这些平台包含大量信息,可以帮助我们回答一些问题,例如今天有多少人表现出冠状病毒症状? 但是,并非所有信息都是平等创建的-这些平台还包含许多错误信息,可能会对公众造成伤害。
We developed a system to track and analyse tweets that mention symptoms of COVID-19. This system ‘listens’ for tweets that mention COVID-19 symptoms. Once identified, tweets are fed through a machine learning classifier which identifies whether it relates to a user’s personal symptoms, someone else’s symptoms or if the tweet contains misinformation.
我们开发了一个系统来跟踪和分析提及COVID-19症状的推文。 该系统“监听”提及COVID-19症状的推文。 一经识别,推文就会通过机器学习分类器进行馈送,该分类器将识别推文是否与用户的个人症状,其他人的症状有关或推文是否包含错误信息。
We can also use geolocation data to calculate the number of users who tweet about symptoms in each region of a given country (where geolocation is permitted by the user). From this data, it is possible to determine the number of users who travel between different regions of a given country. This information could potentially help to identify new outbreak clusters within a country and provide insight into how members of the public responded to lockdown measures.
我们还可以使用地理位置数据来计算在给定国家/地区(用户允许进行地理位置定位)中发布有关症状的推文的用户数量。 根据此数据,可以确定在给定国家/地区的不同区域之间旅行的用户数量。 这些信息可能有助于识别一个国家内的新暴发群,并深入了解公众对锁定措施的React。
To make this information easily accessible, we developed a ‘Symptom Watch’ dashboard, which reports a daily count of the number of tweets that mention symptoms. These counts are currently provided per state in the USA and at various levels (local and upper tier authority, NHS region and national) in the UK. This functionality will be extended to other countries in the near future.
为了使这些信息易于访问,我们开发了“症状观察”仪表板,该仪表板每天报告提及症状的推文数量。 目前,在美国每个州和英国,在各个级别(本地和上级主管部门,NHS地区和国家)提供这些计数。 此功能将在不久的将来扩展到其他国家。
We have also been working with Evergreen Life to analyse data from their health and wellness app. In response to COVID-19, Evergreen Life have been asking app users questions to gain insight into the pandemic . Users are asked to report, for example, if they are isolating or if they or someone in their household has symptoms. The depth and breadth of the data collected is impressive and could answer an endless number of questions.
我们还一直与Evergreen Life合作,以分析其健康和保健应用程序中的数据。 作为对COVID-19的回应,Evergreen Life一直在向应用程序用户询问问题,以深入了解这一流行病。 例如,要求用户报告他们是否正在隔离,或者他们或家庭中的某人是否有症状。 所收集数据的深度和广度令人印象深刻,可以回答无数问题。
The team has developed solutions to answer to some of these questions — for example the average duration an individual experiences symptoms of COVID-19 for. User reports to the Evergreen Life app are sporadic and we therefore don’t see a complete timeline of reports for the full duration an individual is exhibiting symptoms. To deal with the sporadic nature of user reports, we defined and fit a Bayesian model in the ‘Stan’ programming language, which enabled us to determine that users were most likely to experience symptoms for 3.06 days.
团队已经开发出解决方案来回答其中的一些问题,例如,个人经历COVID-19症状的平均持续时间。 向Evergreen Life应用程序发送的用户报告是零星的,因此,我们看不到一个人在整个症状表现期间的完整报告时间表。 为了处理用户报告的零星性质,我们在“ Stan”编程语言中定义并拟合了贝叶斯模型,这使我们能够确定用户最有可能在3.06天内出现症状。
Where users report a household member exhibiting symptoms, we can gain insight into the interaction of COVID-19 within households by determining the time between two household members falling ill. We also know whether a user is isolating and subsequently develops symptoms. From these reports, we can quantify whether isolating reduces your chances of developing coronavirus. We analysed data collected between March and June this year and determined that individuals who did not isolate were 35% more likely to report symptoms within 7 days of reporting that they were not isolating.
在用户报告有症状的家庭成员的地方,我们可以通过确定两个家庭成员患病之间的时间来了解家庭中COVID-19的相互作用。 我们还知道用户是否正在隔离并随后出现症状。 从这些报告中,我们可以量化隔离是否减少了您发展冠状病毒的机会。 我们分析了今年3月至6月之间收集的数据,确定未隔离的人在报告未隔离的7天内更有可能报告症状35%。
The work we have done so far demonstrates how novel data streams can be utilised to gain a deeper understanding of the COVID-19 pandemic. When combined with more conventional data streams, these novel data streams could aid governments in making more informed decisions to combat the virus.
到目前为止,我们已经完成的工作证明了如何利用新颖的数据流来更深入地了解COVID-19大流行。 当与更常规的数据流结合使用时,这些新颖的数据流可以帮助政府做出更明智的决策来对抗病毒。
Matthew Carter is a PhD student who is part of the EPSRC CDT in Distributed Algorithms. This blog was originally posted on the University of Liverpool COVID-19 Hub.
马修·卡特(Matthew Carter)是博士生,是分布式算法EPSRC CDT的一部分。 该博客最初发布在利物浦大学COVID-19 Hub上 。
数据扩充中的数据对抗性?