Course
Tutor
University
City
Date:
 
Abstract
Challenges associated with text classification methods have widely been researched on in data mining, database, machine learning, and informational retrieval settings with applications in many diverse domains, for instance, target marketing, newsgroup filtering, document organization, and medical diagnosis. There are many methods of text classification, and they include decision trees, SVM classifiers, Bayesian or Generative, Neutral network, and Pattern or Rule-based classifiers. In this report, however, a case study involving the use of Rule-based text classification of texts will be reviewed and also outline a detailed literature review of the same method. Using the input of the expert in the construction of coding systems that are automated, their knowledge must be formalized in a series of logical steps. In this case, referring to an architecture identified as rule-based. Texts can also be analyzed by the use of a computer software to meet the statistical significance the use of phrases and words per clinical code in a large volume of classified texts. In such a case, one can make inferences that some combinations indicate a particular code. Such a system where inferences can be made is identified as the machine learning-based (ML) architecture.
Key words: data, security, code
 
 
Contents
Introduction. 4

  1. Organization and News Filtering. 5
  2. Document Organization and Retrieval 5
  3. Opinion Mining. 6
  4. Spam Filtering and Email Classifications. 6

Section 1 – Critical Review.. 6
Section 2 – Additional References. 10
Full Article Reference. 11
Short Summary of the Articles. 12
Article A. 12
Article B. 14
Article C. 14
Section 3 – Literature Gap. 15
Nurses as Agents of Change. 15
Compatibility. 16
The Attribute of Relative Advantage. 16
Trialability. 16
Observability and Complexity. 16
Conclusion. 17
Bibliography. 18
 
 
 

Introduction

Challenges associated with text classification methods have widely been researched on in data mining, database, machine learning, and informational retrieval settings with applications in many diverse domains, for instance, target marketing, newsgroup filtering, document organization, and medical diagnosis (Shanahan, Qu, and Wiebe, 2006). The challenges associated with classification are defined in many ways, but the primary method is as outlined hereafter. Assuming that there are a set of training records, that is, D = {X1 ,…., XN}, in such a manner that each record is identified with a particular class value which is usually drawn from different sets of K values which are indexed as {1…..K} (Rainer & Prince, 2015). The reason behind using training data is so that a classification model can be created which is utilized in the identification of features that underlie the records of each class label. For example, for every test instance, in this case, that which the class is unknown, the training model is utilized in the prediction of a class label for the same instance.
There are usually ways to which the problems are classified, which are the hard and soft versions. In the hard version, every single label is assigned to one instance, while the soft version is identified by the fact that the test instance is assigned a probability value.  That is, in the hard version, only the single and sure value is assigned to the instance. Other methods for classifying problems are identified by other attributes, for example, the allow ranking of different class choices for test instances (Kowalski & Levy, 2006). Still, others allow the test instance to be assigned with multiple labels.
The concept of assumption is used in the classification of problems. For example, absolute values are assumed for the labels while though it has also been proved that continuous values can also be used as labels (Liu, Gegov & Cocea, 2015). The use of continuous values as labels is commonly identified as regression modeling problem. The problem of classifying texts can be likened to the classification of records with a feature of set values (Liu, Gegov & Cocea, 2015). The only recognizable difference is that classification of records that only information regarding the absence or presence of words is used in a document. In real circumstances, the frequency of words also plays a significant impact in the process of classification (Liu, Gegov & Cocea, 2015). Equally important, the typical domain size of text data is considered greater compared to the classification problem of typical set-values (Liu, Gegov & Cocea, 2015).
The problem of text classification is used in a variety of applications in text mining. The primary domains where text classification is commonly put to use include:

1.    Organization and News Filtering

Today, most news services are electronic in nature, and the migration to electronic nature is attributed to the massive volumes news articles produced on a daily basis by news organizations and institutions (Weiss, Indurkhya, and Zhang, 2010). In such circumstances, it is nearly impossible for people to organize news articles manually. The situation called for the need of invention of automated methods which are usually very handy in categorizing news in many web portals.  The application is also identified as text filtering.

2.    Document Organization and Retrieval

Text classification is also used to organize documents in many domains. Some of these domains include web collections, large digital libraries, social feeds, and scientific literature (Weiss, Indurkhya, Zhang, and Damerau, 2010). Documents collections that are organized in a hierarchical manner are useful in retrieval and browsing (Weiss, Indurkhya, Zhang, and Damerau, 2010).

3.    Opinion Mining

When customers give opinions and reviews about a product, they usually do it using short texts documents (Weiss, Indurkhya, Zhang, and Damerau, 2010). Opinion mining, in this case, refers to the manner in which such documents can be analyzed to come up with the most useful reviews.

4.    Spam Filtering and Email Classifications

The process of email classification is used in determining and classifying emails with purposes of either identifying the junk emails or the subject in an automated method (Weiss, Indurkhya, Zhang, and Damerau, 2010).
There are many methods of text classification, and they include decision trees, SVM classifiers, Bayesian or Generative, Neutral network, and Pattern or Rule-based classifiers (Aggarwal & Zhai, 2012). In this report, however, a case study involving the use of Rule-based text classification of texts will be reviewed and also outline a detailed literature review of the same method.

Section 1 – Critical Review

In the article, the research area where the author focuses on is on the challenges faced by clinical coding systems. The author has also hypothesized that the use of rule-based extraction technique in most industries because it is an easier to understand technique. However, with the increased research in the use of machine learning methods for information extraction, the author has also explored on the effectiveness of a high-breed based technique. Equally important, the author has also addressed on the use of machine learning-based audit as a system of coding that has been developed for the use by the neurosurgical department of a hospital that deals mostly with cases of trauma.
In the article, the author has indicated that the extraction of data from clinical notes has been the center of attention for most researchers in the past few years. The reason behind the focus is attributed to the fact that it has become imperative for all information in electronic health records to be made I such a way that they can be shared. The current computer-based systems have been structured in such a way that the data they need is incorporated in the same machines. The physicians have thus continued to feed clinical notes in their natural language which have required them to be able to express themselves efficiently through the spoken or written word. In such a case, it has become a challenge and also strenuous for individuals to derive coded data from the notes bade by the physicians in a different and separate step. Even though the author has indicated that the manual way has been identified as the most accurate and reliable method, it is also time demanding, requires a lot of labor input, depend on the availability of competent personnel, it is repetitive and finally expensive. Equally important, availability of competent personnel does not guarantee accuracy as they can easily miss out on some data due to not paying attention.
Challenges associated with computer-based technique as indicated in the last paragraph have necessitated the need to come up with other more relevant approaches. One of the approaches is coming up with a computer system that will enforce the entry of coded data about the clinical encounters in an unobtrusive method. The second approach deals with extracting coded data after entering of data through free text notes. In either of these approaches, the logic put to use suggests that codes can be derived by using a computer system that infers the right codes through observing patterns in data that were correctly coded. The second logic suggests that derivation of codes can be achieved through consultation with a system expert.
In the article, the author has also indicated that to use the input of the expert in the construction of coding systems that are automated, their knowledge must be formalized in a series of logical steps. In this case, referring to an architecture identified as rule-based. The author has also indicated that the author texts can also be analyzed by the use of a computer software to meet the statistical significance the use of phrases and words per clinical code in a large volume of classified texts. In such a case, one can make inferences that some combinations indicate a particular code. Such a system where inferences can be made is identified as the machine learning-based (ML) architecture. The uniqueness associated with this architecture is that future prediction from new texts are made and is attributed to the patterns the software has come acquainted with.
According to the author, both machine learning-based systems and rule-based have been embraced in automating code extraction from clinical texts. The author has indicated that the most accurate system comes from the hybrid of the two systems. The author has also indicated the advantages and disadvantages associated with the rule-based technique. The merit is that the method does not have a need for a standard of reference, it is easy to comprehend. Equally important, it is also open to refinement, which means that one make it more accurate. On the other hand, the disadvantage is that they take a lot of time to create as more consultation and programming efforts are involved. Also, they require a lot of effort to maintain. Comparing the two systems, machine based –learning systems are more automated, easy to maintain, and also create at the expense that they are not easily understood also require a reference standard. Also, unlike the rule-based system, the machine based learning system cannot be tuned to acquire accurate results.
The author has focused on the neurosurgical department in a major trauma hospital setting. In the hospital, the author has indicated that in the hospital where research was conducted, neurosurgical related data was kept by internally crafted admissions record system. The system had jargons that were highly abbreviated, and auditing codes were derived annually. Also, the author has also indicated that the coding systems and data were highly specific to the department but no previous research on the way forward in automating the process of deriving codes. Lack of such information is the gap to which the research has focused on addressing.
In the article, the author has concluded that the adoption of the hybrid system was optimal. The paper has also gone into detail on the aspects of research, in this case, the method that can be used in enhancing the primary rule-based predictions. Equally important, the author put to detail how both the rule-based system and the machine learning-based system were integrated to through the use of weight vectors of a support vector machine. Equally important, the manner in which the ML predictions were refined and filtered through a technique identified as post-processing to obtain more accurate results have also been put to detail.

Section 2 – Additional References

Full Article Reference

A.    Aronson, A. R., Bodenreider, O., Demner-Fushman, D., Fung, K. W., Lee, V. K., Mork, J. G., Névéol, A., Peters, L., and Rogers, W. J. 2007. “From Indexing the Biomedical Literature to Coding Clinical Text: Experience with MTI and Machine Learning Approaches,” in Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language ProcessingBioNLP ’07, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 105–112 (available at http://dl.acm.org/citation.cfm?id=1572392.1572412).
B.     Khademi, S., Haghighi, P. D., Lewis, P., Burstein, F., and Palmer, C. 2015. “Intelligent audit code generation from free text in the context of neurosurgery,” (available at https://www.researchgate.net/profile/Frada_Burstein/publication/286590834_Intelligent_au
C.     Häyrinen, K., Saranto, K., and Nykänen, P. 2008. “Definition, structure, content, use and impacts of electronic health records: A review of the research literature,” International Journal of Medical Informatics (77:5), pp. 291–304 (doi: 10.1016/j.ijmedinf.2007.09.001).

Short Summary of the Articles

Article A.

The article addresses the application of a combination of classification and indexing systems, which have been proved to be successful in the retrieval of information and also in the classification of medical literature. The article has also looked into tasks of assigning the clinical history with ICD-9_CM codes. The same are also assigned to the sections of impression of reports of radiology. In the article, some of the methods have been put to detail and they include SVM, simple pattern matching method, the NLM Medical Text Indexer system, and k-NN. The basic method are brought together through the use of variant of stacking.
Article B.
The article seeks to address the tasks associated with the creation of structured data from information that is expressed naturally. The article has also identified the need for the use of codified data in clinical auditing, which that coded data can be used for analysis of patterns as well as aggregation. The article has also addressed the fact that it is difficult to obtain structured data in the medical domain. The challenge is attributed to the fact that in medical centers, medical encounters are best expressed and recorded through a natural language. The article has also defined data extraction, which is coming up with structured data from information that have been expressed naturally. The use of data structures is very common in specialized areas of medicine. However, the process of translating the data has many challenges, and for this reason, research has been conducted to overcome the barriers.
Besides the befits associated with the collection of structured data, the article has also indicated that most health care professionals value the primary method through which they express themselves because it enhances workflow efficiency. It has also indicated that systems used in the acquisition of structured data have inflexible and unnatural user interfaces that are identified as a burden to a busy clinician. The article has indicated that researchers have proposed that the only method through which such challenge can be overcome is by leveraging computing technologies to extract codified data from the free-form clinical notes. The article has also indicated that such can only be achieved through the use of post-hoc text processing. Article C.
The article addresses the concept of use of electronic health records. It has also indicated that there exists a wide range of systems of information, from longitudinal data collections of patients to files compiled by a single department. The article has also indicated that the use of electronic health records was common in both primary, secondary, and tertiary care. Most medical staff had the authority to record data in the electronic health records. From the nurses to the physicians. Also, the secretarial staff can also feed data on the electronic health records, on one condition. That is, on condition that the nurses instructed them to do so. The secretarial could also feed data on the electronic health records if they obtained data from the manual notes of the physicians. At other times, the patients could also feed data on these records, that is, data that have been validated by the physicians. The article has also recommended that in the future development of information systems, the requirements and needs of the different users should be put into account. There are different data components that fed to the electronic health records. Some of them which include the physicals assessments, administration of medication, nursing notes upon admission, charts obtained on a daily basis, nursing care plans, complains such as extended symptoms, past lifestyles of the patients, physical examination results, diagnoses tests, immunization, and other findings.

 
Short Accounts of how the Articles are Relevant
Article A.
The article will be useful because it will detailed outline of the application of a combination classification and indexing which have been proved to beneficial in data retrieval and also in classifying of medical literature. The article will also be used because it will provide to the reader the methods used coming with ensemble.

Article B.

The article will be relevant to the entire study because one, it will define the concept of data extraction. The article will also avail information on the challenges associated with interpretation of structured data. Additionally, information and perceptions of clinicians towards technological changes in their working environments is also analyzed in the article.

Article C.

The relevance of the article cannot be underestimated. One, the article explains how the electronic health records are defined. The article will explain the whole concept of digital context in the hospital settings. Also, the article will explain the concepts under which the electronic health records are used. Another relevance of the article is that it will identify the individuals who have access to the electronic health records, data components used, the purpose of conducting research in this field, and also the structure of documents fed to the electronic health records.
 
 

 

Section 3 – Literature Gap

The article previewed lacks to address some important aspects. In the article, it is clear that the use of data structures and electronic records are a new concept that requires the staff to be trained so that they acquire the necessary skills. Especially, article C, “Häyrinen, K., Saranto, K., and Nykänen, P. 2008. “Definition, structure, content, use and impacts of electronic health records: A review of the research literature,” International Journal of Medical Informatics (77:5), pp. 291–304 (doi: 10.1016/j.ijmedinf.2007.09.001)” has also indicated that in the next information technology advancement in health care settings should put into account the different staff involved. The article should have addressed the role the nurses should have played as agents of shade as discussed by Rogers in the five attributes. That way, the nurses would be comfortable in accepting the changes.

Nurses as Agents of Change

In health care centers, nurses are the one people that should be identified as a means to which new technologies should be adopted. There are numerous advantages associated with the use of technology. For example integrating electronic health records in these facilities and also incorporating internet in these places translates improve health care among the patients.  Also, through the ability to share medical data, clinicians will be able to consult with their counterparts during any time of the day or night. However, it is not a guarantee that the nurses are ready to accommodate changes. In the main article, it has been indicated that some clinicians tend to perceive the adoption of data structures as time-consuming and rigid. It, therefore, call for the five attributes proposed by Rogers to explain the advantage as well the process through which nurses can be convinced to adopt the new technological systems and in turn be the agents of change.

Compatibility 

The attribute of compatibility addresses the issue whether the new technological technique will be compatible with the old system. If the systems indicate some levels of compatibility, the chances are that there will be no uncertainties to the nurses who are supposed to facilitate the implementation of the new system. According to Holden (2011), the new systems of technology must also go in line with the social, cultural values and beliefs of the nursing community. It has also been noted that the electronic health records are being propelled at higher levels to transform the entire health care industry. That is, they are not being adopted to just replace the old manual health records and expressivity, but are being taken as a means to achieve improved health care practices where each patient will be attended for at a time (Holden, 2011).
With the implementation of technology in the hospitals, for instance the electronic health records, nurses are fearing for their adoption because they have come across reports reflecting on failures of their implementation. Such cases have instilled fear on the clinicians and they have ended up developing adverse attitudes towards the adoption of technology (Holden, 2011). Clinicians are therefore advised to embrace the implementation of electronic health records based on the merits associated with such technology other than rumors (Holden, 2011).

The Attribute of Relative Advantage

The attribute of relative advantage seeks to address advantage the new system has over the old system. According to Brambel et al., (2010), the attribute indicate that the new system has to have some better economic impact, that is, the use of electronic health record technology has to be way much better than the old system (Bramble et al., 2010). The nurses have therefore, be called upon not to judge the adoption of the technological systems but to attempt to identify a new change based on trend, reliability, and merits. Equally important, the issue on the ground is that merits for adopting new technology, data structures, must outweigh the risks associated with them (Bramble et al., 2010).
In article C for example, studies did show that nurses are skeptical of embracing the data extraction methods because they have perceived the technology to be inflexible, and that it would hinder them from attending to nurses in a flexible manner. Also, they found that it was time consuming in that they would not be capable of multi-tasking. In the same article, the secretarial staff were also skeptical because they could only feed data on these technological systems only when dictated by the nurses or the physicians. Such cases have made the entire healthcare staff to be skeptical on the new changes. Any healthcare organization should seek to make changes and convince the staff that employing such technological changes is far much better (Bramble et al., 2010).

Trialability

The attribute addresses the manner in which new technological systems should be introduced to a healthcare center. According to Rogers (2010), the new system should be introduced in phases. That is, the old systems should not be taken away, but instead, some few processes should be carried out using the new system. That way, the nurses would be able to compare the ease, comfort, and also benefits associated with new systems compared to the old techniques (Blumenthal & Tavenner, 2010).

Observability and Complexity

The attribute of observability addresses the extent to which the nurses can experience the benefits associated with new systems. Nurses and other forms of the staff are more likely to adopt and embrace a new technological system if they can see the merits associated with it (Martinez-Garcia & Pulido, 2010). The attribute of complexity on the other hand address the challenges and levels of difficulty in knowing how to operate the new systems (Martinez-Garcia & Pulido, 2010). Health care institutions can adopt technological system in two ways while at the same time putting into consideration the reactions of the clinicians. The methods are the phased, and big bang approach (Martinez-Garcia & Pulido, 2010). The big bang approach is where all the staff, that is, the nurses, secretarial staff, and the physicians are required to attend some training for some short time on a single day (Martinez-Garcia & Pulido, 2010). The phased approach on the other hand requires that the departments attend training sessions in an incremental method and could take months before the training is completed (Martinez-Garcia & Pulido, 2010).

Conclusion

In conclusion, the paper has covered ways to which challenges of classification of texts are classified, which are the hard and soft versions. In the hard version, every single label is assigned to one instance, while the soft version is identified by the fact that the test instance is assigned a probability value.  That is, in the hard version, only the single and sure value is assigned to the instance. Using the input of the expert in the construction of coding systems that are automated, their knowledge must be formalized in a series of logical steps. In this case, referring to an architecture identified as rule-based. Texts can also be analyzed by the use of a computer software to meet the statistical significance the use of phrases and words per clinical code in a large volume of classified texts. In such a case, one can make inferences that some combinations indicate a particular code. Such a system where inferences can be made is identified as the machine learning-based (ML) architecture.
 
 

Reference List

Aggarwal, C. C., & Zhai, C. (2012). Mining text data. New York, Springer. http://www.books24x7.com/marc.asp?bookid=54151.
Aronson, A. R., Bodenreider, O., Demner-Fushman, D., Fung, K. W., Lee, V. K., Mork, J. G., Névéol, A., Peters, L., and Rogers, W. J. 2007. “From Indexing the Biomedical Literature to Coding Clinical Text: Experience with MTI and Machine Learning Approaches,” in Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language ProcessingBioNLP ’07, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 105–112 (available at http://dl.acm.org/citation.cfm?id=1572392.1572412).
Blumenthal, D., & Tavenner, M. (2010). The “meaningful use” regulation for electronic health records. New England Journal of Medicine363(6), 501-504.
Bramble, J. D., Galt, K. A., Siracuse, M. V., Abbott, A. A., Drincic, A., Paschal, K. A., & Fuji, K. T. (2010). The relationship between physician practice characteristics and physician adoption of electronic health records. Health Care Management Review35(1), 55-64.
Castillo, V. H., Martínez-García, A. I., & Pulido, J. R. G. (2010). A knowledge-based taxonomy of critical factors for adopting electronic health record systems by physicians: a systematic literature review. BMC Medical Informatics and Decision-Making10(1), 1.
Emani, S., Yamin, C.K., Peters, E., Karson, A.S., Lipsitz, S.R., Wald, J.S., Williams, D.H., and Bates, D.W., 2012. Patient perceptions of a personal health record: a test of the diffusion of innovation model. Journal of medical Internet Research14(6), p.e150.
Häyrinen, K., Saranto, K., and Nykänen, P. 2008. “Definition, structure, content, use and impacts of electronic health records: A review of the research literature,” International Journal of Medical Informatics (77:5), pp. 291–304 (doi: 10.1016/j.ijmedinf.2007.09.001).
Holden, R. J. (2011). What stands in the way of technology-mediated patient safety improvements? A study of facilitators and barriers to physicians’ use of electronic health records. Journal of Patient Safety7(4), 193.
Khademi, S., Haghighi, P. D., Lewis, P., Burstein, F., and Palmer, C. 2015. “Intelligent audit code generation from free text in the context of neurosurgery,” (available at https://www.researchgate.net/profile/Frada_Burstein/publication/286590834_Intelligent_au
Kowalski, T. J., & Levy, L. S. (2006). Rule-Based Programming. Boston, MA, Springer US. http://public.eblib.com/choice/publicfullrecord.aspx?p=3079472.
Liu, H., Gegov, A., & Cocea, M. (2015). Rule based systems for big data: a machine learning approach. http://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&db=nlabk&AN=1062737.
Rainer, R. K., & Prince, B. (2015). Introduction to information systems: supporting and transforming business.
Rogers, E. M. (2010). Diffusion of innovations. Simon and Schuster.
Shanahan, J.G., Qu, Y. and Wiebe, J. eds., 2006. Computing attitude and affect in text: theory and applications (Vol. 20). Dordrecht, the Netherlands: Springer.
Weiss, S.M., Indurkhya, N. and Zhang, T., 2010. Fundamentals of predictive text mining (Vol. 41). London: Springer.
Weiss, S.M., Indurkhya, N., Zhang, T. and Damerau, F., 2010. Text mining: predictive methods for analyzing unstructured information. Springer Science & Business Media.