Document classification needs the definition of classes a priori. Thus, a first analysis of the dataset and of the context where the dataset has been developed and for whom and for what it could be used in the future is requires. The dataset used in the experimentation is composed by project documents derived from public tendering and offering document from construction companies that participated to the tendering procedure. The documents have been collected in the Italian context, hence a first understanding of the documents that can be found is provided by the Italian law for public contracts (Decree of the President of the Republic October 5, 2010, n. 207) that establishes the minimum document for tending project during its different stages. In synthesis the main documents are following listed:

  • General report of the work.
  • Technical reports and specialist reports.
  • City planning documents for the work.
  • Environmental impact study.
  • Calculations about structures and plants.
  • Description and performances of technical elements.
  • Design of interferences.
  • Price description and list.
  • Estimative metric computation.
  • General economic framework.
  • Contractual schema.
  • Tender specification.
  • Safety and security plan.

On the other hand, analyzing the organizational structure of some engineering companies, it is possible to define a more general subdivision where the above mentioned documents can be included. Engineering and construction companies are usually based on a project oriented organization (Shirazi, B., Langford, D. A., & Rowlinson, 1996; Ilin et al., 2016). Moreover, according to the dimension of the company can be identified several departments related to specific area of actions, e.g. infrastructures, buildings, and energy. The project oriented nature is intersected with the main sectors of the organization creating a matrix structure. Due to the high variability of this last organizational structure, it is difficult to define a precise a clear detailed subdivision. Nevertheless, it is possible to identify some general areas as following listed:

  • Administration: this department can control all the bureaucracy related to the project. In some cases the same department manage the quality, while in other structure there is a dedicated department according to the dimension of the company.
  • Security and Safety: this department includes all the aspects of security and safety.
  • Design: this department collects the aspects related to architectural design and technology (in this last area it shares many component with the engineering one).
  • Engineering: this department includes all the engineering aspects, e.g. structures, HVAC, electrical, and fire safety.
  • Economic and management: this department deals with the economic and scheduling themes.

In the literature, Caldas, Soibelman and Han (2002) proposed 13 classes during the classification of minutes related to a construction project. The classes were defined studying the structure of the document set and divided as following listed:

  • General
  • Schedule
  • Demolition-Civil
  • Landscape-site
  • Structures
  • Building_skin
  • Roofing_waterproof
  • Interior finishes
  • Conveyance
  • Plumbing
  • Fire protection
  • HVAC
  • Electrical

This classification includes some elements that are not included in the dataset considered in this study. In fact, while the above listed classes where related to the construction phase, our dataset is related to the project and tendering phase. However, de facto all the elements found during the classification of documents during the construction project can be included in the more general classification proposed.
Usually, in engineering and construction organizations, the engineering department do not exist as a whole but it is divided according to the specialization (e.g. structure, HVAC, electrical, plumbing, fire protection). This is also in accordance with the classification proposed by , Caldas, Soibelman and Han (2002). However, the a priori identification of the specific area can limit the algorithm imposing specific disciplines, while the evolution of the construction sector can introduce new specialization areas (e.g. internet connections related to the Internet of Things or 4G and 5G installations for buildings). Hence, this subdivision can be managed through the clustering algorithm that allows the evolution of the clusters identification according to the evolutions of the dataset.