paint-brush
How to Build Supplier Risk Profilesby@textmining

How to Build Supplier Risk Profiles

by Text MiningDecember 22nd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This article focuses on extracting and structuring supplier, contract, and item data from XMLs to build comprehensive supplier risk profiles and databases.
featured image - How to Build Supplier Risk Profiles
Text Mining HackerNoon profile picture
  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

2.2. Task definition

2.2. Task definition Relating to the project’s ultimate goal of enabling the dynamic creation of ‘supplier risk profiles’, we envisage a database scheme that is partially (for reasons of both proprietary information and simplicity) depicted in Figure 4 showing entities (rounded boxes), their attributes (dashed boxes) and relationships (lines). Briefly, we hope to create a record for each contract award that has one buyer and one supplier, with associated lots (one or multiple) and item information. We store certain attributes of the buyer, supplier, contract award, lots and items, such that we can run queries to fetch ‘supplier centric’ award information like the range of products (based on lot items) they have supplied, the monetary value and quantities for different kinds of products, their buyers, and covered geographical regions (buyer country). Previously we mentioned that one award XML may mention multiple contracts with multiple suppliers. In practice, we ‘flatten’ data extracted from the award XML into multiple contract award records.


Figure 4. Partial database schema showing information of suppliers, buyers, contract awards, lot and item information to be extracted and stored. Round boxes indicate entities, dashed boxes list attributes in bullet points (only showing those important for understanding the rest of this article). The ‘1s’ indicate there is one supplier and buyer in one award record; the ‘*s’ indicate there can be multiple lots in an award record and multiple items in a lot.


As already mentioned, much of the information captured by this schema is already in (semi-)structured format available from the tender and/or award XMLs. These can be easily extracted through XML parsing, although some light-weight data cleansing and linking may be needed (e.g., extracting the country from a buyer/supplier address, or matching buyer names from the tender and award XMLs). These tasks are relatively straightforward and will not be the focus of this article. Instead, the real challenge is extracting and populating the lot and item information (indicated in red dotted box) that is often missing in both the tender and award XMLs. Therefore, our tasks can be defined as it is shown in Figure 5 - given a tender XML, its associated award XML, and the associated tender attachments, we aim to:


1. Extract the structured lot and item information often missing in tender and award XMLs (filling the schema highlighted in the red dotted box in Figure 4; also the middle lane in Figure 5):


a. From the tender attachments, identify the relevant documents, and the relevant content areas (i.e., tables, lists, paragraphs) that contain information about the lots (lot zoning);


b. From the identified content areas, detect content elements (e.g., sentences) associated with lot items (lot item detection);


c. From the detected lot items, parse the lot and item descriptions into a structured representation, identifying attributes such as the lot reference, name of the items, measurement units, etc (lot parsing). As an example, Figure 6 shows how we want to extract structured lot and item information from the example in Figure 3.


2. Extract the structured supplier, buyer, and contract award data from the tender and award XMLs (filling other parts of the schema shown in Figure 4; also the upper and bottom lanes in Figure 5).


3. Join the information extracted from 1 and 2 above to form supplier-centric contract records and populate the database.


4. Develop supplier risk indices using the populated database and a platform for exploring these indices.


For the scope of this article, we will focus on 1), briefly cover 2) and 3) as they are more straightforward and completed outside the scope of this project, and skip 4).


Figure 5. The overall workflow of the award record database construction process.


Figure 6. An example showing how structured lot information should be extracted.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);

(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);

(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).


This paper is available on arxiv under CC BY 4.0 license.