{"id":257,"date":"2020-02-03T04:23:03","date_gmt":"2020-02-03T04:23:03","guid":{"rendered":"http:\/\/martinsiron.com\/?p=257"},"modified":"2020-02-03T04:23:03","modified_gmt":"2020-02-03T04:23:03","slug":"using-named-entity-recognition-and-natural-language-processing-to-build-a-map-of-accumulated-infections-of-n-cov2019","status":"publish","type":"post","link":"http:\/\/martinsiron.com\/index.php\/2020\/02\/03\/using-named-entity-recognition-and-natural-language-processing-to-build-a-map-of-accumulated-infections-of-n-cov2019\/","title":{"rendered":"Using Named Entity Recognition and Natural Language Processing to build a map of accumulated infections of n-Cov2019"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I begin by scraping the following website for all its information and translating the text using Google Translate API: <a href=\"https:\/\/ncov.dxy.cn\/ncovh5\/view\/pneumonia_timeline?whichFrom=dxy\">https:\/\/ncov.dxy.cn\/ncovh5\/view\/pneumonia_timeline?whichFrom=dxy<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The data consists of a header, a description, a timestamp of some sort, and the source from where the information came from. Using the HTML library in Python I extracted these various sections of the page and place it into a Pandas Dataframe.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After some cleaning, I used \u201cNamed Entity Recognition\u201d to extract locations from both the descriptions and the titles. Since location is such a common use for NER, I decided to use an already existing model: DeepPavlov. I loaded this model and extracted all the locations and put them in their own Pandas columns in the data frame. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next, I need to extract the latitude and longitude from the name of these locations. Using GeoPy, I extracted and averaged all the locations in both the header and description. For now, I opted to use only the location information in the header as from a quick glance, this tends to be the main region or entity where the cases happened, rather than the city levels or entities that might have tested the case in the descriptions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, I was left with my ultimate bottleneck, text in the\ndescription like this:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>&#8220;From 00:00 to 24:00 on January 31, 2020, Shanxi Province reported 8 new confirmed cases of pneumonia due to new coronavirus infection. As of 14:00 on January 31, 47 cases of pneumonia confirmed by new coronavirus infection have been reported in 11 cities in Shanxi Province (including 2 severe cases, 1 critical case, 1 discharged case, and no deaths). At present, 1333 close contacts have been tracked. 41 people were released from medical observation on the same day, and a total of 1101 people were receiving medical observation.&#8221;<\/p><\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>From this I\u2019ve already extracted this information:<\/strong><br><span style=\"text-decoration: underline;\">Time<\/span>: January 31, 2020<br><span style=\"text-decoration: underline;\">Location<\/span>: Shanxi Province<br><strong>However, I still need to extract the following information<\/strong><br><span style=\"text-decoration: underline;\">New cases:<\/span> 8 new confirmed cases of pneumonia (8)<br><span style=\"text-decoration: underline;\">Accumulated cases:<\/span> 47 cases of pneumonia confirmed (47)<br><span style=\"text-decoration: underline;\">New deaths:<\/span> no deaths (0)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In order to approach this, I needed to train a custom NER model.\nI used TagTog\u2019s tagging and AI tool to do this work:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"974\" height=\"212\" src=\"http:\/\/24.144.91.142\/wp-content\/uploads\/2020\/02\/report-tagtog.png\" alt=\"\" class=\"wp-image-258\" srcset=\"http:\/\/martinsiron.com\/wp-content\/uploads\/2020\/02\/report-tagtog.png 974w, http:\/\/martinsiron.com\/wp-content\/uploads\/2020\/02\/report-tagtog-300x65.png 300w, http:\/\/martinsiron.com\/wp-content\/uploads\/2020\/02\/report-tagtog-768x167.png 768w\" sizes=\"auto, (max-width: 974px) 100vw, 974px\" \/><figcaption>TagTog: Annotation Tool<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This makes it easy to train by selecting the various information in a document. I created 8 entities and 4 relationships:<\/p>\n\n\n\n<table class=\"wp-block-table is-style-stripes\"><tbody><tr><td><strong>   Entity   <\/strong><\/td><td><strong>   Description   <\/strong><\/td><td><strong>   Relations   <\/strong><\/td><\/tr><tr><td>new_case_I   <\/td><td>New case added in location   <\/td><td>New cases and it\u2019s number   <\/td><\/tr><tr><td>new_case_N   <\/td><td>Number of new cases   <\/td><\/tr><tr><td>acc_case_I   <\/td><td>Accumulated case in region   <\/td><td>Accumulated cases and its number   <\/td><\/tr><tr><td>acc_case_N   <\/td><td>Number of accumulated cases   <\/td><\/tr><tr><td>new_death_I   <\/td><td>New death recorded in region   <\/td><td>New deaths and its number   <\/td><\/tr><tr><td>new_death_N   <\/td><td>\n  Number of new deaths\n  <\/td><\/tr><tr><td>acc_death_I   <\/td><td>Accumulated death in region   <\/td><td>Accumulated deaths and its number   <\/td><\/tr><tr><td>acc_death_N   <\/td><td>\n  Number of accumulated deaths\n  <\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">I manually tagged 20% of the data to build the model. From this, I parsed the rest of the data through the model to extract the remaining information. From the initial document example, the model extracted the following information:<\/p>\n\n\n\n<table class=\"wp-block-table is-style-stripes\"><tbody><tr><td>\n  Text\n  <\/td><td>\n  Entity\n  <\/td><td>\n  Probability\n  <\/td><\/tr><tr><td>&#8216;no&#8217;<\/td><td>&#8220;accumulated_death-N&#8221;   <\/td><td>0.874<\/td><\/tr><tr><td>&#8216;8&#8217;<\/td><td>&#8220;new_case-N&#8221;   <\/td><td>0.838<\/td><\/tr><tr><td>&#8216;no deaths&#8217;<\/td><td>&#8220;accumulated_death-I&#8221;   <\/td><td>0.698<\/td><\/tr><tr><td>&#8216;1&#8217;<\/td><td>\u201cnew_death-N\u201d   <\/td><td>0.621<\/td><\/tr><tr><td>&#8216;1&#8217;<\/td><td>\u201cnew_death-N\u201d   <\/td><td>0.550<\/td><\/tr><tr><td>&#8217;47&#8217;<\/td><td>&#8220;accumulated_case-N&#8221;   <\/td><td>0.813<\/td><\/tr><tr><td>&#8217;41&#8217;<\/td><td>&#8220;accumulated_case-N&#8221;   <\/td><td>0.714<\/td><\/tr><tr><td>&#8216;8 new confirmed cases&#8217;<\/td><td>&#8220;new_case-I&#8221;   <\/td><td>0.842<\/td><\/tr><tr><td>&#8217;47 cases of pneumonia confirmed&#8217;<\/td><td>&#8220;new_case-I&#8221;   <\/td><td>0.640<\/td><\/tr><\/tbody><\/table>\n\n\n\n<p class=\"wp-block-paragraph\">We can see that the model with 20% of the data did fairly well. <strong>Numerically, we were able to pick up the correct new case number (8) and the correct accumulated case number (47).<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With the rest of the data parsed, I used Plotly to build a final heatmap (data from morning of 01\/28). We can see from this that the model has clearly picked up the epicenter and the surrounding virus activity nearby.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"500\" src=\"http:\/\/24.144.91.142\/wp-content\/uploads\/2020\/02\/final_heatmap.png\" alt=\"\" class=\"wp-image-259\" srcset=\"http:\/\/martinsiron.com\/wp-content\/uploads\/2020\/02\/final_heatmap.png 700w, http:\/\/martinsiron.com\/wp-content\/uploads\/2020\/02\/final_heatmap-300x214.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><figcaption>Heatmap of nCov2019 Accumulated Infections as of 01.28.2019<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">I plotted the accumulated cases over time:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"432\" height=\"288\" src=\"http:\/\/24.144.91.142\/wp-content\/uploads\/2020\/02\/acc_cases_overtime.png\" alt=\"\" class=\"wp-image-260\" srcset=\"http:\/\/martinsiron.com\/wp-content\/uploads\/2020\/02\/acc_cases_overtime.png 432w, http:\/\/martinsiron.com\/wp-content\/uploads\/2020\/02\/acc_cases_overtime-300x200.png 300w\" sizes=\"auto, (max-width: 432px) 100vw, 432px\" \/><figcaption>Accumulated cases over time for nCov2019 Infections<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, I made a GIF heatmap of the virus activity over time (note the scale bar changes overtime):<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"500\" src=\"http:\/\/24.144.91.142\/wp-content\/uploads\/2020\/02\/nCov2019-Heatmap-Compressed.gif\" alt=\"\" class=\"wp-image-261\"\/><figcaption>Animated heatmap over time. Scalebar changes overtime.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Future:<\/strong><br>In the imminent future, I plan on releasing a Flask app on my website with the data, <strong>stay tuned! <\/strong>Additionally, I will update it with the most recent data released from 01.28.2020.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I begin by scraping the following website for all its information and translating the text using Google Translate API: https:\/\/ncov.dxy.cn\/ncovh5\/view\/pneumonia_timeline?whichFrom=dxy The data consists of a header, a description, a timestamp of some sort, and the source from where the information came from. Using the HTML library in Python I extracted these various sections of the&hellip;<a href=\"http:\/\/martinsiron.com\/index.php\/2020\/02\/03\/using-named-entity-recognition-and-natural-language-processing-to-build-a-map-of-accumulated-infections-of-n-cov2019\/\" class=\"button\">Read more <span class=\"screen-reader-text\">Using Named Entity Recognition and Natural Language Processing to build a map of accumulated infections of n-Cov2019<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-257","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/posts\/257","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/comments?post=257"}],"version-history":[{"count":0,"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/posts\/257\/revisions"}],"wp:attachment":[{"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/media?parent=257"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/categories?post=257"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/martinsiron.com\/index.php\/wp-json\/wp\/v2\/tags?post=257"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}