Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
The goal of this component is to take the sentences (including those converted from table rows) classified as containing lot references/items (positive sentences), and parse them to create a structured representation of the lots/items such as that shown in Figure 6. There are two tasks in this process: 1) determining the boundaries of lots and the lot references, as the previous steps only classifies a sentence/table row as if it contains lot information; 2) from a positive sentence, extracts the structured item information including: name of the item, form, and measurement.
For 1), we use simple rules as follows:
● Apply the pattern to match the start of a sentence ‘lot [token]+ [num]’, where [token] is an optional word (e.g., ‘number’, ‘no.’) and [num] is a number-like token. If a positive sentence matches this pattern, it is considered to contain a lot reference and the [num] value will be extracted as the reference;
● Each sentence the above rule marks the boundary of a lot;
● Positive sentences including/following an identified lot reference are considered to be associated with that lot and are subject to the second part of processing.
As examples, in Figure 2, we expect rows 2 and 3 to be considered as part of a single lot with a reference indicated in column 1, and each row to describe a separate item. In Figure 6, we expect each bullet point in Section 2.1 to be classified as a relevant lot item, with the lot reference extracted as ‘Lot 1’ for the first and the remaining text for further parsing. In Figure 7, we expect the rows in the first and second boxes to be classified as relevant, while the first row identifies the reference and the second describes the item.
For 2), in theory, one can build an NER tool given sufficient training data. However, creating training data at token level is very time-consuming and deemed infeasible by Vamstar. Instead, Vamstar built an in-house rule-based tagger that uses a combination of regular expressions and dictionary lookup. Details of this part of the system is proprietary information and cannot be included in this article. However, we explain the main principles below.
First, using all values extracted from the ‘lot and item descriptions’ field of the TED database mentioned before, n-grams are extracted and their frequencies are generated. Domain experts were then involved in manually inspecting the derived ranked list of n-grams, to determine a ‘threshold’ (practically, different thresholds were derived for different ‘n’s) above which n-grams were deemed to be more reliable and retained as a lookup dictionary - to be called the ‘lot and item n-gram dictionary’. Second, this dictionary and together with the ‘form’ and the ‘measurement unit’ dictionaries mentioned before are used to match texts within a sentence. Finally, once the matching is completed, post-processing rules will be applied to address overlapping boundaries, or extract other values such as quantities. For example, given ‘Atropine Sulfate Solution for Injection 3 mg/10 ml Pre-filled Syringe’, the ‘matching’ phase may identify ‘atropine sulfate’ and ‘sulfate solution’ as candidate n-grams, ‘mg’ and ‘ml’ as measurement, and ‘syringe’ as form. The post-processing will merge ‘atropine sulfate’ and ‘sulfate solution’, and process a context window of the identified measurement units to extract quantities.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);
(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);
(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).
This paper is