Data Import and Preparation
The initial step in any data mining process is to make sure that data is prepared in a manner that will allow for analysis and mining. In this project, there are some variables which are empty and have no values at all. The first step was to make sure that these columns are filtered out to remain only with columns that contains data. Below is a screenshot of how the data looked like after filtering out missing columns using Knimes data mining tool.
Using Knime it is possible to find some of the basic as well as complex statistical analysis of variables. The following are some of the statistical analytics that can be reported from the dataset.
It is important in such a project to determine the relationship between the variables. Below is a screenshot of the correlation matrix and correlation table of the data set variables.
Figure 1 Correlation Matrix
From this analysis it is possible to note that there is a strong relationship between the data variables based on the correlation table and matrix above. This is an indication that the employee data from the data set are closely related. Also the values do not differ by a big margin as can be seen that the correlation of the minimum values and maximum values is 0.811. The correlation value among the mean of the variables is 0.977 which approximately equal to 1. These value indicates a perfect uphill which is a positive linear relationship between the distributions of the variables. Below is a sample of a scatter plot generated from the above data.
General Descriptive Statistics
Below is the general statistical table from the analysis.
Some of the variables that we could analyze from this include;
Employers’ Number of Employees.
From the data analysis using the Knime data mining software, it can be seen that the minimum possible number of employees recorded was 3 employees while the maximum recorded number of employees in this period was 181, 000. This was data for the first 50 sample in the whole population of 20000+ samples.
Drawing down the analysis on this variable, there is a lot of information that could be generated. From the summary, it can be concluded that there was a total of 895,000 employees in the industries working. There is no missing value in this set of figures since evry data was reported. The average sum of all the employees in companies is estimated to be at 17900 in all the companies. The pie chart below shows the distribution of the standard deviation values for the number of employees in the listed companies;
Figure 2 Correlation Scatter Plot
The scatter plot shows the distribution of the samples are highly concentrated towards 1.0. which indicates a positive uphill and thus a strong relationship among the variables. This is the concentration of employees in the companies listed.
Figure 3 Case Status vs employee number histogram
The histograms also indicates the relationship between the case status and the number of employees in the companies. From this it can be deduced that there is a steady trend in the number of employees based on the status. This indicates a uniform distribution of the employees in all the cases of employment status. From the case status histogram, the highest number of employees falls in the certified group. This a confirmation that for one to be hired him/her has to be certified. This is one of the highest ranking factor in the job industry. The next level is the expired certifications but with a bit higher number of employees. The conclusion from the expired certifications could be that most employees who were hired have not renewed or progressed in their certification journey. Those employees who have withdrawn from the employment follows with a significantly low margin. This is an indication that the companies are majoring their selection criteria on the academic qualifications while hiring. As a senior member in the company’ setup, this is a meaningful information to major the judgment while hiring. Skills are less considered in this case but the level of qualification of the candidate is likely to offer him/her a job.
This information is very useful even to the management when deciding on the decision to make before hiring based on various alternatives. The number of employees in the denied group is very low an indication that those who are less qualified rarely make an attempt to enter into the job market. However, the steady distribution in the number of employees indicates a balance among the population based on the experience, qualification and retirement or withdrawal.
Also comparing the number of employees in the companies based on the year of establishment of the company, it is possible to note the slight variations that were noticed in the data collected. This is common in any management of a company where employees will tend to shift to newly formed companies due to factors like compensation rates and reliability on their experience. The hierarchical clustering below extracted for the analysis shows the distributions and the variations in the levels on the number of employees in the companies. As a manager it is important to always review terms of the employees in order to keep the retention level high.
Figure 4 Hierarchical Clustering for employee numbers vs the year of establishment of the company
Number of Employees vs Year of Establishment Statistics
Comparing the number of employees and the year of establishment of the companies, the results were as shown in the tables below.
Figure 5: Employer_num_employees Histogram
While analyzing the data from the employee’s database, it is important looking into the statistical factors like skeweness in order to come up with a conclusion. The skeweness of data explains the distribution of the statistics either to the left or to the right. The extent to which the distribution of the variables differs from the normal distribution can be explained with the skeweness of data. For a normal distribution there has to be symmetry in the histogram. In this case, the skeweness in the number of employees is lining to the right but with a small margin this indicates that there was a close extend of normal distribution in the data being analyzed. The value of skeweness was at 3.1544 which indicates a very minimal value. The histogram also shows this distribution in the case of the continuous flow of data. The first Column shows a difference in the height but the second and following heights shows some symmetry in their height an indication of a stable relationship between the values. It is possible to confirm that there was less version in the number of employees in the companies with the existence of the companies although the number kept reducing with time. As the years of establishment increased, the number of employees in the companies reduced steadily.
The comparison between the two variables shows some kind of inverse proportionality in their growth. This is an indication to the management that there is likely to a shift in their number of employees with time of existence. The main idea that the management should get from this is to get a way of retaining their employees lest they lose them to other companies. From the skeweness there is a meaningful comparison of the data values and instead of the management having to go around asking or finding out what could be affecting their performance, it is possible to just look at the statistical skeweness from the samples and confirm what is the issue on ground. The number of employees in a company really helps in determining the performance of the employees. The reduction in number and shifting can really cause a lot of issues.
Among some of the issues that would be recommended form such a summary is some of the factors that would lead to such a change. From a basic view and understanding, working conditions could either attract employees to the company or cause a migration to a competitor company. This is normal in partner companies and any other company setup in the world. Human beings tend to shift based on the standards and level of upkeep they earn from their jobs. It is not possible for a company that has been running for many year to be overtaken by small companies but this is happening in real world due to the fact that new companies are coming up with new compensation strategies to their employees.
In data mining data processing is among the major goals of the data mining project. From data processing it is possible to derive conclusions based on the data collected and prepared in the data mining tools. Among the most common data processing capabilities in data mining is binning. Binning helps in reducing minor errors in data in order to get the preempted results. Data binning is close to or similar to quantification. In this process, the original data values that fell into small intervals had to be replaced by the representatives of that interval in order to make sense in mining.
In the dataset under mining and analysis, the number of employees is among the most important variable that is controlled by all other variables in the dataset. Every variable affects and touches on the number of employees. Some factors cannot be present in presence or missing values in the number of employees. All these data processes sing activities in this section must also run on data that is well structured by having been grouped based on the variables. In this case data was grouped based on the type of certification, the employee citizenship, the employer state and some other variables that can be grouped to form a meaningful structure. The binning process relies on the output if this grouping to form a structured set of data that can be analyzed further. Below is a screenshot of part of the binned data, as it can be seen the data is packed and well organized with no missing values and all the variables have been arrange next to each other in a complete order. From the data structure it is possible to draw conclusion and analysis procedures can be conducted. The columns have a structure that is possible to identify all the variables and make an inspection on which analytic procedure to conduct on the data.
Binning is therefore an efficient way of making sure that the processing of this data is at a position to offer meaningful conclusions during the process of analysis. The binning process on the employer_num_employees variable is an evidence of how the rectification of the width and depth of the data is important in making sure that the analysis is a smooth process.
After the data had been imported and prepared in the Knime tool, it had to undergo binning to smoothen the values that will be effective in the analysis process. The attached excel sheet indicates the spreadsheet of the data obtained in the process of binning based on the number of employees. This is critical in any cases of requirements to analyze the variable against other. The binning process was successful and resulted into smooth data as can be evidenced in the excel sheet. The completion of the process is by mapping it onto the Knime workflows as can be indented.
Figure 6: Data Binning
The data binning process was conducted on the dataset and a total of 122 columns were presented in the results. The table above contained a set of 50 rows of data for each variable,. Sometimes it is important to reduce the amount of records in the dataset in order to conduct an analysis well. The sub setting process can only result in accurate analysis if there is a positive uphill in the correlation matrix of the data. As from the statistical analysis above, it was clear that almost all the variables had a close relationship and there was evidence of positive uphill in every variable. This makes it possible even to make use of small sample of the data set to conduct an analysis and come up with a conclusion.
Normalization of data
In data mining most of the techniques used involve distance computations. It is a significant factor to consider standardizing the variables in order to avoid the influence of the variables with higher values to the model. Various normalization processes can be undertaken in data mining such as min-max normalization to transform the data values into some range. In this data mining project, there was need to conduct some different types of normalization on the data to form some structure that was able to be defined within the analytic process. In this dataset normalization was controlled by employer_num_employees variables which represented the total number of employees in every company listed in this data.
The first process of normalization in the dataset was to normalize the employer_num_employees using min-max normalization to transform the values into [0.0-1.0] range. The results were as below.
Figure 7: Range Normalizer
The employer_num_employees variable column can be seen to have been standardized with the values in the range of 0 as the lowest value and 1 as the highest value this is the result of the normalization process.
The second normalization process was to use z-score normalization to transform the values as shown below.
Figure 8: Z-Score Normalizer
Range. In the employer_num_employee, the average range was between 0 and 1. The highest range was 1, which was for Merrill Lynch. The lowest score on the other hand was 0 for, Lohan Chiropractic and Acupuncture Clinic LLC. These ratios showed that a foreigner worker had an equal opportunity for employment at Merrill Lynch, as shown by the ratio of 1. In Lohan Chiropractic and Acupuncture Clinic LLC.; however, the foreign worker had an almost zero chance since the range was zero. Overall, most businesses had a range of less than 0.5. This range showed that opportunities were mostly skewed towards local citizens over foreigners.
Z-Score. The highest Z-score was in Merrill Lynch, A2Z Development, TRW Automotive Inc., Microsoft, and Google Inc., which had scores of 4.508, 3.7647, 1.274, 1.16, and 0.9939 respectively. The lowest scores on the other hand were for Matech, Zylog System, Adara Media, The University of Alabama in Huntsville, Groupware Solution, Inc. The average z-score for these companies was -0.480. The low Z-scores showed that on the average, these companies had very few foreign workers. In addition, there were higher likelihoods of a local being employed over a foreign worker in these organizations.
After normalization with z-score, it is now possible to see the values in the column of number of employees has now been rescaled with z-index values that had the range estimates of between -0.49 and 4.6. This is possible to help in understanding the distribution of the values in terms of the maximum and minimum values.
Discretise of the employer_num_employees
The next normalization process was to discretize the employer_num_employee variable into five categogries as categories: Startup=0-10; Small_Scale=11-100; Medium_Scale=1012000; Large_Scale=2001-20000, Giant_Scale=20001+, by providing the frequency of each category in your data set. The following were the results for every normalization. As can be seen in every column the number of employees was set in the ranges of the categories as per the required criteria. The startup group was set and the results were as below. In this category the number of employees was between 0 and 10.
Figure 9: Startup Group 1
The next category of employees fell in the small scale group with the numbers between 11 and 100 as shown below.
Figure 9: Startup Group 2
Figure 10: Medium Scale Group.
Medium scale employees had values in the range of 101 to 2000.
Large scale category had values ranging from 2001 to 2000 and finally the giant group was all values above 20000.
Figure 11: Large Scale Group
Figure 12: Giant Scale Group
The range of each category varied with the expected number of employees. Ordinarily, a large company has many workers than a small one. Therefore, the ranges increased progressively depending on the size of the workers. The smallest range was the 0-10 for the Startup Group and the largest one was the Giant Group, which was values greater than 20000+.
In conclusion, the more skilled a foreign worker is, the more likely he/she will get employment. From the analysis, it was observed that individuals with certificates have higher chances of getting work than those without. Accordingly, the certified ones will be more employable. Given that certified workers do soon lose their certificates and later these documents are withdrawn, it shows that most do not renew these permits, or they do not have enough wages to renew them. The analysis also shows that on the overall, companies have more local workers than foreigners and also prefer hiring local workers. Accordingly, foreign workers must expect to have significantly lower opportunities than foreigners.
- small scale employees
- range [0.0-1.0]
- medium scale employees
- large scale employees
- large scale employees
- giant scale employees
- startup scale employees