Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
Due to the multi-task and heterogeneous dataset nature of our work, we can identify many areas of research that are relevant. Our literature review will focus on the following scope. First, we give an overview of the research of different text mining and NLP tasks relevant to our work. We avoid presenting details of individual studies due to the too wide scope and the massive amount of literature. Our intention instead is to draw literature that helped inform developing our solution and reflect on why mainstream research findings cannot be easily adopted in industrial projects. Second, we examine the areas where text mining and NLP are used to develop real world applications, such as the legal and construction domains mentioned above. Again we will focus on a ‘big picture’ with an aim to articulate the difference from the procurement domain. Finally, we give a detailed discussion of work done in procurement text mining, the core area that our work belongs to. Here we will cover details of each study and compare it against this work.
Text mining and NLP are two different, but overlapping research fields each encompassing a wide range of tasks. The first focuses on discovering or extracting information from unstructured texts, NLP on the other hand, focuses on enabling machines to understand human natural language. For example, named entity recognition (NER) and relation extraction (Han et al, 2020; Nasar et al., 2022) serve both purposes and therefore, can be viewed as both text mining and NLP methods. However, speech recognition is a subfield of NLP but not related to text mining. Both text mining and NLP are long-established research fields that witnessed decades of development and the creation of sub-communities looking at specific tasks. Due to the space limit of this article, we only discuss a number of tasks that are relevant to this work, including: text classification, NER, text passage retrieval, and table information extraction.
Text classification (Kowsari et al., 2019) is one of the earliest text mining and NLP problems and it has been widely studied and addressed in many real applications. In simple terms, the goal is to assign predefined categories to text passages. In their survey on text classification, Kowssari et al. (2019) identified four levels of text passages: full document, paragraph, sentence, or sub-sentence where segments of a sentence are examined. While the definitions of labels and text passages depend on domains, it is widely acknowledged that two key sub-processes are feature extraction and classifier training. Feature extraction converts texts into numerical vector representations for machine learning, and over the years has evolved from extracting lexical/syntactic patterns (e.g., bag of words, n-grams, words, part of speech) to word embeddings that are learned through modelling word contexts based on very large corpora (Birunda and Devi, 2021). When feature extraction results in high dimensional, sparse vectors, dimension reduction techniques (e.g., Principal Component Analysis, Latent Dirichlet Analysis) are used to transform a high dimensional vector into a low dimensional one optimised for machine learning. Following this, the classifier training stage attempts to learn patterns that can differentiate texts with different labels from their feature representations. This requires a dataset with ground truth labels to be provided (i.e., training data).
NER (Nasar et al., 2022) deals with the extraction and classification of mentions of named entities from texts. In ‘He was named on the bench by Mexico in midweek but Wolves received an apology for that as Raul Jimenez was never going to play’, texts in italics are NEs that are classified as country, organisation (football club) and person (footballer) respectively. NEs represent important information in a text and are often anchors for locating more complex information such as relationships and events. By definition, NER comprises two sub-processes that are often dealt with by supervised classification. The first is to identify ‘boundaries’ of an NER which can be a sequence of tokens. The second is classifying that sequence into predefined categories. Therefore, arguably, NER can be seen as low granularity text classification at token level and hence follows the same ‘feature representation’ and ‘classifier training’steps. Features are often described at token-level, and can include for example, lexical/syntactic patterns of the token, and those of its preceding and following tokens, or word embeddings.
Both text classification and NER are relevant to our work as the first may help filter irrelevant documents or content areas through binary text classification, while the second allows targeting specific units of information (e.g., contract items, volume) that need to be extracted. However, we face unique challenges of no training data, and highly inconsistent content structure. Research in both fields are long-established, and predominantly report studies conducted with well-curated datasets in consistent format and structures. For example, multiple initiatives (Sang and Meulder, 2003; Piskorski et al., 2021) have been set up to foster NER research. But the training data are well-curated, free form sentences, even with raw input tokens and their features extracted and aligned. These hardly represent real-world problems. As discussed before, data we are dealing with are highly inconsistent in terms of their content formatting and structure, and we have no training data but general purpose domain lexicons. It is impractical to directly apply state-of-the-art methods in our scenario.
Passage Retrieval (PR) is often used as a subprocess in Information Retrieval and Question Answering (Zhu et al., 2021). The goal is to find matching textual fragments (or passages) for a given query, albeit at a lower granularity than document. PR deals with splitting a long document into passages, and scoring and ranking them against a query or question. In Othman and Faize (2016), common methods for each task are well-summarised. Briefly, passage splitting may be based on textual discourse units such as subsections and paragraphs, arbitrary windows that select a fixed or variable number of words/sentences, or ‘semantic similarity’ where units such as NEs are extracted from documents first and matched against the query, and then used as anchors to locate text passages. Ranking passages may be based on statistics such as the classic TF-IDF model used for document retrieval (Llopis et al., 2015), or machine learning that takes into account different features extracted from each passage.
PR is related to our work as part of our task is to identify text passages that contain relevant information for further analysis. However, our work is different in two ways. First, we do not have specific queries or questions to match against, but a rather vague notion of ‘relevance’. Second, we need to identify multiple passages, while in typical PR for IR or QA, a single best match is needed.
Table Information Extraction (TIE) aims at extracting data points from tabular content and recovering the semantic relationships between data points. This includes a very wide range of tasks that can be generally divided into those detecting and parsing table structures to understand the geometric relationships between table elements, and those conducting semantic analysis of its content to interpret the relationships embedded in the textual content. A recent survey (Zhang and Balog, 2020) covers some of these tasks, while here we only explain the tasks relevant to our work briefly. Table Structure Parsing (Paliwal et al., 2019) usually starts with images (including PDFs where content is not machine accessible) that contain tabular data, to detect tables, locate their cells, and analyse the row-column relationship (including complex column/row spans). Table Interpretation deals with identifying entities from table text, and their semantic relations encoded by the table structure.
Both tasks are relevant to our work. As mentioned above, a significant portion of the relevant content in our procurement data is encoded in tables, which are predominantly found in PDF documents. But the structure of tables varies depending on countries and contracts. Therefore, TSP is essential for detecting such content, parsing the complex tabular geometric relationships, and converting the data into a structured, machine accessible format. However, tables can include irrelevant data, or they can encode certain types of content and relationships in different ways (e.g., different pairing of columns). Hence TI is needed to interpret the semantic meanings of table columns and rows, to allow extracting the only relevant information into structured formats. As we shall explain later, while we are able to use state-of-the-art TSP tools to detect and extract tables, we have to opt for a different approach to TI. This is because typically, TI depends on an external knowledge base that provides essential information for interpreting the content and their semantics in a table. In our case, such a KB does not exist.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);
(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);
(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).
This paper is