Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
In this section, we discuss lessons learned from the project that may potentially inform future research and practice. These will be covered from several angles: the different focus of industry project compared to research, the complexity of building a full NLP pipeline for heterogeneous data and its implication on the development, the dilemma of choosing between the more advanced NLP methods and those earlier classic methods, and the issue of training data in an industry context.
There are very few studies that explicitly discuss the different focuses between industry and research projects and how these could impact the approach. Among these (Chiticariu et al., 2013; Suganthan et al., 2015; Krishna et al., 2016), it is widely acknowledged that the lack of training data and the associated cost of creating it, the fast development cycle, the need for interpretability of machine learning predictions, and the continuous update of the model due to evolving business needs and knowledge are factors that typically render a research-focused approach impractical. Industry projects focus on ‘problem solving’ in real world contexts with (often harsh) time and resource constraints, while research projects focus on ‘creative solutions’ to problems often studied in an ideal ‘lab’ environment that may not fully represent the reality.
Our experience has supported most - if not all - of the viewpoints above. The industry project has a clear focus of developing a ‘proof-of-concept’ product within a time limit of one year and financial budget. The business partner needs to be able to take the finished product for further development and/or extension beyond the project. Analysis at the beginning of the project showed its ‘multi-task’ and ‘heterogeneous data’ nature. Combining all these factors, a typical research-driven approach aimed at developing ‘novel’solutions built on insights from rigorous critical evaluation of multiple baselines and state-of-the-art using research ‘benchmarks’ that may deviate from the real data, is infeasible. Instead, one lesson we learned is that one needs to opt for ‘tried and tested’ methods that have low ‘barriers’ to users, who may need to take the solution forward for future development. This is a fundamental principle that needs to be taken into account when evaluating other aspects of the project.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);
(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);
(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).
This paper is