Initial Data Exploration: KNIME Tools Analysis
Data Import and Preparation
The initial step in any data mining process was to make sure that data is prepared in a manner that will allow for analysis and mining. In this project, there are some variables that are empty and have no values at all. The first step was to make sure that these columns are filtered out to remain only with columns that contains data. Below is a screenshot of how the data looked like after filtering out missing columns using KNIME data mining tool.
Descriptive Statistics
Using KNIME it is possible to find some of the basic as well as complex statistical analysis of variables. The following are some of the statistical analytics that can be reported from the dataset.
It is important in such a project to determine the relationship between the variables. Below is a screenshot of the correlation matrix and correlation table of the data set variables.
Figure 1 Correlation Matrix
From this analysis it is notable that there is a strong relationship between the data variables based on the correlation table and matrix above. This is an indication that the employee data from the data set are closely related. Also the values do not differ by a big margin as can be seen that the correlation of the minimum values and maximum values is 0.811. The correlation value among the mean of the variables is 0.977 which approximately equal to 1. These value indicate a perfect uphill which is a positive linear relationship between the distributions of the variables.  Below is a sample of a scatter plot generated from the above data.
The average skewness for the model is mostly negative. These values indicate that the data is skewed to the left.
General Descriptive Statistics
Below is the general statistical table from the analysis.
Some of the variables that we could analyze from this include;
Employers’ Number of Employees.
From the data analysis using the KNIME data mining software, it can be seen that the minimum possible number of employees recorded was 3 employees while the maximum recorded number of employees in this period was 181, 000. This was data for the first 50 sample in the whole population of 20000+ samples.
Drawing down the analysis on this variable, there is a lot of information that could be generated. From the summary, it can be concluded that there was a total of 895,000 employees in the industries working. There is no missing value in this set of figures since every data was reported. The average sum of all the employees in companies was estimated to be at 17900. The pie chart below shows the distribution of the standard deviation values for the number of employees in the listed companies;
Figure 2: Data Distribution
Figure 3 Correlation Scatter Plot
The scatter plot shows the distribution of the samples are highly concentrated towards 1.0, which indicates a positive uphill and thus a strong relationship among the variables. This is the concentration of employees in the companies listed.
Figure 4 Case Status vs Employee Number Histogram
The histograms indicate the relationship between the case status and the number of employees in the companies. From figure 4, it can be deduced that there is a steady trend in the number of employees based on the status. This indicates a uniform distribution of the employees in all the cases of their employment status. From the case status histogram, the highest number of employees falls in the certified group. This image confirms that for a person to be employed, most emphasis is out on whether he or she is certified. The next level is the expired certifications. The conclusion from the expired certifications could be that most employees who were hired have not renewed or progressed in their certification journey. Those employees whose certificates have been withdrawn are in the third level, and have a significantly low margin in prioritization. This is an indication that companies main selection criteria is on the qualifications of personnel, and whether they are allowed to perform certain activities by possessing an active certificate. As a senior member in the company’ setup, this is a meaningful information to major the judgment while hiring. Skills alone are less considered in this case but the level of qualification of the candidate is likely to offer him/her a job. This information is very useful even to the management when deciding the criteria that they will use when hiring. The number of employees in the denied group is very low an indication that those who are less qualified rarely make an attempt to enter into the job market. Even in the case where these individuals apply for work, they are given the last priority.
Visually, the employer number of employees’ row indicates that only three categories are actually given the opportunity to work: certified, certified-expired, and withdrawn. Among the permitted employees, the certified employees are the least workers, who are represented by a single bar, the certified-expired are the second most employees and are represented by four bars. The most employees are the withdrawn workers and are represented by more than fifteen bars. This information indicates that employees do not usually renew their certificates, which results in this documents expiring and eventually been withdrawn. The lack of representation of declined workers category, indicates that all workers in the organization had certificates at one time. As such, we can conclude that employers only hire individuals that are certified (those with certificates). However, employees never renew their certificates, which makes them to expire and eventually be withdrawn. From this information, it is also highly likely that most employers never follow up to see that their employees have renewed their licenses.
Also comparing the number of employees in the companies based on the year of establishment of the company, it is possible to note the slight variations that were noticed in the data collected. This is common in any management of a company where employees will tend to shift to newly formed companies due to factors like compensation rates and reliability on their experience. The hierarchical clustering below extracted for the analysis shows the distributions and the variations in the levels on the number of employees in the companies. As a manager it is important to always review terms of the employees in order to keep the retention level high.
Figure 5 Hierarchical Clustering for employee numbers vs the year of establishment of the company
Number of Employees vs Year of Establishment Statistics
Comparing the number of employees and the year of establishment of the companies, the results were as shown in the tables below.
While analyzing the data from the employee’s database, it is important looking into the statistical factors like skeweness in order to come up with a conclusion. The skeweness of data explains the distribution of the statistics either to the left or to the right. The extent to which the distribution of the variables differs from the normal distribution can be explained with the skeweness of data. For a normal distribution there has to be symmetry in the histogram.  In this case, the skeweness in the number of employees is lining to the right but with a small margin this indicates that there was a close extend of normal distribution in the data being analyzed. The value of skeweness was at 3.1544 which indicates a very minimal value. The histogram also shows this distribution in the case of the continuous flow of data. The first Column shows a difference in the height but the second and following heights shows some symmetry in their height an indication of a stable relationship between the values. It is possible to confirm that there was less version in the number of employees in the companies with the existence of the companies although the number kept reducing with time. As the years of establishment increased, the number of employees in the companies reduced steadily.
The comparison between the two variables shows some kind of inverse proportionality in their growth. This is an indication to the management that there is likely to a shift in their number of employees with time of existence. The main idea that the management should get from this is to get a way of retaining their employees lest they lose them to other companies. From the skeweness there is a meaningful comparison of the data values and instead of the management having to go around asking or finding out what could be affecting their performance, it is possible to just look at the statistical skeweness from the samples and confirm what is the issue on ground. The number of employees in a company really helps in determining the performance of the employees. The reduction in number and shifting can really cause a lot of issues.
Among some of the issues that would be recommended form such a summary is some of the factors that would lead to such a change. From a basic view and understanding, working conditions could either attract employees to the company or cause a migration to a competitor company. This is normal in partner companies and any other company setup in the world. Human beings tend to shift based on the standards and level of upkeep they earn from their jobs. It is not possible for a company that has been running for many year to be overtaken by small companies but this is happening in real world due to the fact that new companies are coming up with new compensation strategies to their employees.
Data Processing
In data mining data processing is among the major goals of the data mining project. From data processing it is possible to derive conclusions based on the data collected and prepared in the data mining tools. Among the most common data processing capabilities in data mining is binning. Binning helps in reducing minor errors in data in order to get the preempted results. Data binning is close to or similar to quantification. In this process, the original data values that fell into small intervals had to be replaced by the representatives of that interval in order to make sense in mining.
In the dataset under mining and analysis, the number of employees is among the most important variable that is controlled by every variable in the dataset. The existence of all the variables depend on this variable. Binning was therefore an efficient way of making sure that the processing of this data is at a position to offer meaningful conclusions during the process of analysis. The binning process on the employer_num_employees variable is an evidence of how the rectification of the width and depth of the data is important in making sure that the analysis is a smooth process.
After the data had been imported and prepared in the KNIME tool, it had to undergo binning to smoothen the values for them to be effective in the analysis process. The attached excel sheet indicates the spreadsheet of the data obtained in the process of binning based on the number of employees. This is critical especially where there is a need of analyzing one variable against another. The binning process was successful and resulted in smooth data as can be evidenced in the excel sheet. The completion of the process was by mapping it onto the KNIME workflows as can be indented.
Normalization of data
In data mining most of the techniques used involve distance computations. It is a significant factor to consider standardizing the variables in order to avoid the influence of the variables with higher values to the model. Various normalization processes can be undertaken in data mining such as min-max normalization to transform the data values into some range. In this data mining project, there was need to conduct some different types of normalization on the data to form some structure that was able to be defined within the analytic process. In this dataset normalization was controlled by employer_num_employees variables which represented the total number of employees in every company listed in this data.
The first process of normalization in the dataset was to normalize the employer_num_employees using min-max normalization to transform the values into [0.0-1.0] range. The results were as below.
The second normalization process was to use z-score normalization to transform the values as shown below. From the data, foreign workers were found to be concentrated in California, Michigan, and New Jersey. In California, there were mostly Indians, Chinese, Bahamas’, and South Koreans. Michigan had mostly South Koreans.
The next normalization process was to discretize the employer_num_employee variable into five categogries as categories: Startup=0-10; Small_Scale=11-100; Medium_Scale=1012000; Large_Scale=2001-20000, Giant_Scale=20001+, by providing the frequency of each category in your data set. The following were the results for every normalization. As can be seen in every column the number of employees was set in the ranges of the categories as per the required criteria. The startup group was set and the results were as below. In this category the number of employees was only between 0 and 10. From the report, companies that had the greatest number of foreign workers were Microsoft, Google, Merrill Lynch, and A2Z Development Inc. Some of those with the lead workers were A-1 Support Service and Cross Commerce Media.
The next category of employees fell in the small scale group with the numbers between 11 and 100 as shown below.
Medium sacle employees had values in the range of 101 to 2000.
Large scale category had values ranging from 2001 to 20000 and finally the giant group was all values above 20000.
Please find attached copies of the following:

  1. Binnedtable
  2. z-score
  3. small scale employees
  4. range [0.0-1.0]
  5. medium scale employees
  6. large scale employees
  7. large scale employees
  8. giant scale employees
  9. startup scale employees