Data Mining Helps in Cancer Research
According to the American Cancer Society, half of all men and one-third of all women in the
United States will develop cancer during their lifetimes; approximately 1.5 million new cancer
cases were expected to be diagnosed in 2013. Cancer is the second-most-common cause of death
in the United States and in the world, exceeded only by cardiovascular disease. This year, over
500,000 Americans are expected to die of cancer—more than 1,300 people a day—accounting
for nearly one of every four deaths. Cancer is a group of diseases generally characterized by
uncontrolled growth and spread of abnormal cells. If the growth and/or spread are not
controlled, it can result in death. Even though the exact reasons are not known, cancer is believed
to be caused by both external factors (e.g., tobacco, infectious organisms, chemicals, and
radiation) and internal factors (e.g., inherited mutations, hormones, immune conditions, and
mutations that occur from metabolism). These causal factors may act together or in sequence to
initiate or promote carcinogenesis. Cancer is treated with surgery, radiation, chemotherapy,
hormone therapy, biological therapy, and targeted therapy. Survival statistics vary greatly by
cancer type and stage at diagnosis. The 5-year relative survival rate for all cancers is improving,
and decline in cancer mortality had reached 20% in 2013, translating into the avoidance of about
1.2 million deaths from cancer since 1991. That’s more than 400 lives saved per day! The
improvement in survival reflects progress in diagnosing certain cancers at an earlier stage and
improvements in treatment. Further improvements are needed to prevent and treat cancer. Even
though cancer research has traditionally been clinical and biological in nature, in recent years
data-driven analytic studies have become a common complement. In medical domains where
data- and analytics-driven research have been applied successfully, novel research directions
have been identified to further advance the clinical and biological studies. Using various types of
data, including molecular, clinical, literature-based, and clinical trial data, along with suitable data
mining tools and techniques, researchers have been able to identify novel patterns, paving the
road toward a cancer-free society. In one study, Delen (2009) used three popular data mining
techniques (decision trees, artificial neural networks, and SVMs) in conjunction with logistic
regression to develop prediction models for prostate cancer survivability. The data set contained
around 120,000 records and 77 variables. A k-fold cross-validation methodology was used in