In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 52,000 scholarly articles, including over 41,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This dataset is publicly available in Kaggle’s COVID-19 Open Research Dataset Challenge (CORD-19). I used only articles in the JSON format for this model. The dataset has 7865 articles in JSON format which contains 64,000 unique sections and 1.1M sentences. Also, it has articles metadata such as data published, authors, title, and abstract.
This solution primary for the medical domain so I used BioSentVec pre-trained model for embedding. BioSentVec created biomedical word and sentence embeddings using PubMed and the clinical notes from MIMIC-III Clinical Database. Both PubMed and MIMIC-III texts were split and tokenized using NLTK. We also lowercased all the words. More details and a pre-trained model can use accessed here.
This solution is a type of Question Answering model. It is a retrieval-based QA model using embeddings. The basic idea of this solution is comparing the question string with the sentence corpus, and results in the top score sentences as an answer. I create a vector representation of each sentence using a pre-trained BioSentVec embedding model and KNN to find the answer sentences.
Loading the BioSentVec pretrained model
model_path = ‘BioSentVec_PubMed_MIMICIII-bigram_d700.bin’
model = sent2vec.Sent2vecModel()
except Exception as e:
print(‘model successfully loaded’)
Vectorize the sentence corpus with BioSentVec model
embs = model.embed_sentences(convid_sent_df[‘sentence’])
KNN and Ranking
The k-Nearest Neighbors algorithm (KNN)t is a very simple technique. First, I loaded entire vectorized sentences into the model as training. When I need to find the answer, we need to send a vectorized question string as input to the model and the knn model outputs the most similar records from the training sentence corpus with the score. From these neighbors, a summarized answer is made. A similarity between records can be measured in many different ways. I used default here.
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=2, algorithm=’ball_tree’).fit(embs)
Test the Model
Using trained kneighbors model passing the question. As “ What is the physical science of the coronavirus’
emb = model.embed_sentence(‘Physical science of the coronavirus’)
distances, indices = nbrs.kneighbors(emb)