1.            Introduction

Information can be identified as one fundamental asset in every organisation [1]. In the construction industry, especially during the design and construction processes, it is produced a large spectrum of information. Even a small project can generate a large amount of digital information such as specifications, computer-aided drawings, and structural analysis reports [2–4]. The majority of the information is commonly embedded in text data contained in documents delivered during the process [5]. Hence, a large percentage of the construction data is stored on semi-structured and unstructured files [6]. Unstructured texts remain the largest readily available source of knowledge [7]. Hence, text mining is gaining interest in both research and industry as a valuable multidisciplinary area. In contrast with other assets, according to Moody and Walsh [1] the value of information is not affected by the number of users that use the information. Moreover, the value increases according to the increase of information usage. Hence, an increase in the accessibility to information can increase its value promoting its usage. Focusing on text documents, one of the main activity to promote the accessibility of information is the organisation of the documents to facilitate the individuation of the required information. Can be defined two main approaches in the automatic organisation of documents, i.e. classification and clustering. Classification aims to classify documents into pre-defined classes [8,9], while clustering can be defined as the unsupervised classification of patterns into groups called clusters [10].
In the literature can be found several studies devoted to the organisation of documents in the construction sector. The majority of the studies focuses on the classification of documents related to a specific area or a specific subset of real world data. On the other hand, the majority of the existing studies that works using clustering algorithms utilised flat, hard clustering approach [11]. However, real world datasets are dirty and composed by different types of documents that refer to different areas of interest. Hence, the use of algorithm suited for the classification and/or cluster of documents dedicated to a specific area would require a priori document filtering activity to limit the dataset of interest. Moreover, the background structure that characterise a set of documents can change during the time due to new legal requirements, change in the performance requirements, etc. Hence, an approach that include only static classification algorithms seems to limit the possible evolution of the results while clustering would allow an adjustment of the structure according to the evolution undergoing in the sector.
This paper presents a novel framework to optimise document usage through a hybrid approach that applies a classification algorithm and an unsupervised learning clustering algorithm to cluster construction project and construction documents. The results of this procedure can provide the basis for the application of specific and specialised algorithms that can optimise the performance in retrieving documents in specific areas on interest (clusters). This objective is focused on a wider framework dedicated to the optimisation of data retrieving activities presented in Section 2.4. As a result, a system framework to classify and cluster project and construction documents was developed. Moreover, a prototype of the classification and clustering engine was also developed. Experiments were conducted to validate the results and asses the quality of the proposed framework using real project and construction documents provided by construction companies. Real world datasets contain different types of documents, e.g. text documents, drawings, and images. The proposed work focuses on text documents while the study of classification and clustering application on other types of documents represent a point for future developments.
The rest of the paper is organised as follows.

2.            Background


2.1.        Classification and clustering of text data

Classification and clustering are two main functions of text mining processes [7]. Text mining can be defined as the process of synthesizing the information by analysing relations, patterns and rules from textual data [12]. It is different from what is commonly applied as search in text data. In fact, searching engine are focused on finding something already known. In text mining, the goal is to discover unknown information [13].
Classification (or categorisation) is based on the identification of the main themes of a document by placing the document into a pre-defined set of topics (categories) [7,14]. Hence, the goal is to classify a set of documents in a pre-defined and fixed number of categories. In some applications, the same document can be included in more than one category. There are different methods of text classification, namely decision trees, k-nearest neighbour, Bayesian approaches, neural networks, regression-based methods and vector-based methods [13]. Nevertheless, the dominant approach is based on machine learning techniques [14]. Moreover, classification algorithms can be developed according to different constraints. Even if there are several applications and methods, in text classification can be highlighted some constant terms. These are the need to handle and organise large quantities of documents in which the textual component is either unique or simplest to interpret component and the fact that the set of categories is known in advance, and its variation over time is small [13].
Clustering is a technique that can automatically organises a dataset containing a substantial number of data objects into a smaller set of coherent groups [10,15,16]. It can be used to group similar documents [17]. Clustering differs from classification because the documents are clustered dynamically instead of using pre-defined topics and categories [7]. One of the main objective of clustering is to provide a structure to a large dataset by organising similar data together to facilitate search and retrieval tasks [11]. Clustering techniques are used in several disciplines including biology, psychiatry, psychology, archaeology, geology, geography, and marketing [18] with specific applications related to pattern recognition [19], image processing [20] and information retrieval [21,22].
All clustering algorithm are based on similarity measures [7]. Nevertheless, clustering methods can be divided in two main classes, i.e. hierarchical clustering and flat clustering [23]. Hierarchical clustering can be further divided in two class, namely agglomerative (i.e. a bottom up approach where each element starts as a single cluster and pairs of clusters are merged as one moves up in the hierarchy) and divisive (i.e. a top down approach where all the elements start in one cluster and are then split recursively as one move down the hierarchy) [11]. The final result of a hierarchical cluster is a tree with all the elements on one extreme of the tree and a set of clusters containing a single element on the other side. The intervening nodes contains several elements and can contain more or less documents moving up or down the hierarchy (or the tree) [7]. Hierarchical clustering is often identified as the better quality clustering approach, however its application can be limited due to its quadratic time complexity [24]. Moreover, in document clustering application, the inclusion of information referred to different disciplines and/or area of interest can produce a bad clustering results in the case of exclusive clustering processes (i.e. clustering processes where documents are assigned only to a specific cluster). Hence, documents can be classified in macro-classes a priori an then clustered according to the main area identified in the first passage.
Flat clustering techniques include k-means and single pass clustering.
K-means clustering algorithm [25]
Word relativity-based clustering (WRBC) method [26]
Common clustering algorithm,

  • hierarchical,
  • binary relational,

Different technique to determine document similarity from

2.2.        Natural language processing


2.3.        Classification and clustering in the construction sector

During a construction process are generated several documents that are referred to a variety of different subjects. This documents include drawings, requirements and specifications, prices, scheduling, technical product specifications and many more. This picture is further complicated in the case of big project where the area of document management represent a critical aspect of the entire process.
Starting from this peculiarity of the construction process, in the research field have been developed several studies on this area.
A plain an exclusive clusterisation procedure can produce limitation in the approach due to the characteristics of the construction documents that can contain information assigned to several different classes (and/or clusters)
Caldas and Soibelman [5,27] proposed an automatic hierarchical classification of construction project documents to improve information organisation and access.

2.4.        Motivation and aims of the research

The proposed literature review reveal that the majority of the existing experimentations are based on supervised learning techniques that require a defined knowledge on the analysed dataset and a specific definition of the requirements during the development phase. Furthermore, many of the research proposed in literature are based on a single source of documents (single database) usually homogeneous with reference to the requirements of the proposed algorithm of classification and/or document research. Unfortunately, real dataset are composed by a mix of not homogeneous documents and the application of solutions based on homogenous documents set can produce a drastic loss in the efficiency and reliability of the algorithm. Hence, the proposed research try to solve this problem identifying a previous activity to organise complex and heterogeneous dataset to transform it in a set of clusters where single specific application can run effectively. This approach is well known in the field of data mining where it is identified with the name of segmentation [10]. In large databases, clustering methods can be used to segment the database in homogeneous groups so that other applications can work on clusters instead of working with the entire database improving the performances of the final applications. Figure 1 proposes the representation of the logical schema behind this work. Starting from different project developed by an organisation the proposed approach allows to clustering the documents produced during the different project. These clusters can then be used by specific and specialised applications to perform tasks such as document retrieval or other document analysis improving the efficiency of the specific application.

Figure 1 – Representation of the proposed logical schema

The proposed approach includes the use of hierarchical clustering algorithms. However, hierarchical clustering requires a high effort in terms of computable resources and consequently time. This point can hinder the usability of the system when it is used as a direct applications where the time between the user request and the answer from the system need to be the shorter possible. Nevertheless, applications that works on background structures can allow the use of slower procedure to provide a final product that is usable. This is exactly the case of this study. In fact, while the proposed solution works in background clustering the documents, the specific applications that are developed to work on specific clusters can provide the efficiency required by the users.

3.            Framework and methodology


3.1.        System framework



3.2.        Clustering application

According to Jain and Dubes [18], clustering activities involve five main steps, namely pattern representation, definition of a pattern proximity measure appropriate to the data domain, clustering or grouping, data abstraction and assessment of output.
The importance of domain knowledge is clear also in this context. In fact, all clustering algorithm will produce clusters with any type of data set, regardless of whether the data contain clusters or not [10]. Hence, a preliminary analysis of the data set is critical to understand its structure and then choose the best algorithm and consequently assess the quality of the results.

4.            System development and implementation

Prototype developed using Python, several advantages.
Note about NLTK package:
Note about clustering packages
SciPy includes hierarchical clustering applications.

5.            Discussion

Language limits. The proposed system has been developed focusing on a specific language. Construction works are still bounded to national languages due to the need of communicate with local industries and public administrations. Hence, future development of the system can look to the integration of multiple languages and conversion strategy. However, this aspects imply different issues that are not limited to specific language conversion (or dictionary) applications. In fact, the use of technical terms can include hided meanings difficult to explains and/or include in a global and comprehensive application.

6.            Conclusions


7.            References

[1]       D. Moody, P. Walsh, Measuring The Value Of Information: An Asset Valuation Approach, in: Seventh Eur. Conf. Inf. Syst., Frederiksberg, Denmark, 1999: pp. 1–17. doi:citeulike:9316228.
[2]       L. Soibelman, J. Wu, C. Caldas, I. Brilakis, K.Y. Lin, Management and analysis of unstructured construction data types, Adv. Eng. Informatics. 22 (2008) 15–27. doi:10.1016/j.aei.2007.08.011.
[3]       A.J.P. Tixier, M.R. Hallowell, B. Rajagopalan, D. Bowman, Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports, Autom. Constr. 62 (2016) 45–56. doi:10.1016/j.autcon.2015.11.001.
[4]       Y. Zou, A. Kiviniemi, S.W. Jones, Retrieving similar cases for construction project risk management using Natural Language Processing techniques, Autom. Constr. 80 (2017) 66–76. doi:10.1016/j.autcon.2017.04.003.
[5]       C.H. Caldas, L. Soibelman, Automating hierarchical document classification for construction management information systems, Autom. Constr. 12 (2003) 395–406. doi:10.1016/S0926-5805(03)00004-9.
[6]       L. Soibelman, C.H. Caldas, Project extranets for construction management: the American experience, in: Proc. En- Tac-2000, Salvador, Brazil, 2000.
[7]       V. Gupta, G.S. Lehal, A Survey of Text Mining Techniques and Applications, J. Emerg. Technol. Web Intell. 1 (2009) 60–76. http://www.academypublisher.com/jetwi/vol01/no1/jetwi01016076.pdf.
[8]       B. Liu, Web data mining: Exploring hyperlinks, contents, and usage data (data-centric systems and applications), Springer Verlag, New York, NY, USA, 2006.
[9]       C.D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Cambridge University Press, New York, NY, USA, 2008.
[10]     A.K. Jain, M.N. Murty, P.J. Flynn, Data Clustering: A review, ACM Comput. Surv. 31 (1999) 264–323.
[11]     M. Al Qady, A. Kandil, Automatic clustering of construction project documents based on textual similarity, Autom. Constr. 42 (2014) 36–49. doi:10.1016/j.autcon.2014.02.006.
[12]     W. Berry Michael, Automatic Discovery of Similar Words, in: Surv. Text Min. Clust. Classif. Retr., Springer-Verlag, New York, 2004: pp. 24–43.
[13]     S. Niharika, V.S. Latha, D.R. Lavanya, A survey on text categorization, Int. J. Comput. Trends Technol. 3 (2012) 39–45. http://www.ijcttjournal.org/Volume3/issue-1/IJCTT-V3I1P108.pdf.
[14]     F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. 34 (2002) 1–47.
[15]     P. Willett, Recent trends in hierarchical document clustering: A critical review, Inf. Process. Manag. 24 (1988) 577–597.
[16]     A. Huang, Similarity measures for text document clustering, in: Proc. New Zeal. Comput. Sci. Res. Student Conf. 2008, 2008: pp. 49–56. http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf.
[17]     S. Iiritano, M. Ruffolo, Managing the knowledge contained in electronic documents: a clustering method for text mining, in: 12th Int. Work. Database Expert Syst. Appl., IEEE Comput. Soc, Munich, Germany, 2001: pp. 454–458. doi:10.1109/DEXA.2001.953103.
[18]     A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1988.
[19]     M.R. Anderberg, Clustering Analysis for Applications, Academic Press, New York, 1973.
[20]     A.K. Jain, P.J. Flynn, Image Segmentation Using Clustering, in: N. Ahuja, K. Bowyer (Eds.), Adv. Image Underst. A Festschrift Azriel Rosenfeld, IEEE Computer Society Press, 1996: pp. 65–83.
[21]     E. Rasmussen, Clustering Algorithms, in: W.B. Frakes, R. Baeza-Yates (Eds.), Inf. Retr. Data Struct. Algorithms, Prentice Hall, Englewood Cliffs, NJ, 1992: pp. 419–442.
[22]     G. Salton, Development in Automatic Text Retrieval, Science (80-. ). 253 (1991) 974–980.
[23]     W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms, Prentice Hall, 1992.
[24]     M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD Work. Text Min. (2000) 1–2. doi:10.1109/ICCCYB.2008.4721382.
[25]     M. Zhao, J. Wang, G. Fan, Research on Application of Improved Text Cluster Algorithm in Intelligent QA System, in: 2008 Second Int. Conf. Genet. Evol. Comput., IEEE, Hubei, China, 2008: pp. 463–466. doi:10.1109/WGEC.2008.49.
[26]     X. Yang, D. Guo, X. Cao, J. Zhou, Research on Ontology-Based Text Clustering, in: 2008 Third Int. Work. Semant. Media Adapt. Pers., IEEE, Prague, Czech Republic, 2008: pp. 141–146. doi:10.1109/SMAP.2008.14.
[27]     C.H. Caldas, L. Soibelman, J. Han, Automated Classification of Construction Project Documents, J. Comput. Civ. Eng. 16 (2002) 234–243. doi:10.1061/(ASCE)0887-3801(2002)16:4(234).