The 2015 SCSUG Educational Forum will be hosting the 3rd annual Student Sypmosium in Baton Rouge, LA.
Analytics is becoming a major tool for generating maximum value from data and supporting business decisions. The education and training of students in the methodology and software application is critical in filling the demand for such analytical expertise. This technical training must be accompanied by opportunities to enhance the soft skills as well.
Louisiana State University, Oklahoma State University and The University of Alabama are jointly hosting a Graduate Student Symposium in Analytics as a venue for communicating analytical ideas, sharing SAS software techniques, and learning from analytics professionals. The symposium will consist of nine student presenters (three from each university) and will cover a wide variety of analytics topics, including ways in which SAS software is used to analyze data. An industry expert from a sponsoring company will be assigned to each student presenter and will serve as a mentor, providing recommendations and positive commentary after the presentation on how the student’s work contributes to the body of knowledge on methodology and software application.
The Student Symposium will be held on Friday, October 30, 2015 from 10:10am to 4:00pm.
See below for a listing of the scheduled abstracts. For more information, please contact Joni Shreve at jnunner@lsu.edu.
Click on a title to view abstract or click here to Expand All or Collapse All.
Study of Cancer Research by Small Businesses: A Text Analytics Approach Using SAS Enterprise Miner --- Sairam Tanguturi and Nithish Reddy Yeduguri --- Oklahoma State University
Taken as a whole, about half of people receiving treatment for invasive cancer die from cancer or during its treatment. This fact proves the importance in the search for cure to cancer in the diverse field of medical research. Also, the more than two hundred variants of cancer mandate different approaches for treatment. Treatments that work for some cancers don’t work for others and sometimes those treatments simply stop working, making it one of the most complex mysteries of our modern society and the one which needs immediate answers. Cancer research has been the forte of pharmaceutical giants since the beginning, but, of late, small businesses and startups have also started taking an interest in it. We would like to explore through this project the specific areas these small businesses are targeting in a research area with such enormous scope. We hope to cast light on developments in the rapidly growing cancer research in the small business arena. We have chosen the data of the proposals submitted to the Small Business Innovation Research (SBIR) program and have done a text analytics approach using SAS Enterprise Miner and tried to find out exactly what the current small business players are focusing their efforts on in cancer research and also, what areas need more attention.
SAS & R: A Perfect Combination for Sports Analytics --- Matthew B. Collins and Taylor K. Larkin --- The University of Alabama
Revolution Analytics reports more than two million R users worldwide. SAS has the capability to use R code, but users have discovered a slight learning curve to performing certain basic functions such as getting data from the web. R is a functional programming language while SAS is a procedural programming language. These differences create difficulties when first making the switch from programming in R to programming in SAS. However, SAS/IML enables integration between the two languages by enabling users to write R code directly into SAS/IML. This paper details the process of using the SAS/IML command Submit /R and the R package “XML” to get data from the web into SAS/IML. The project uses public basketball data for each of the 30 NBA teams over the past 33 years, taken directly from Basketball-Reference.com. The data was retrieved from 66 individual web pages, cleaned using R functions, and compiled into a final dataset composed of 48 variables and 895 records. The seamless compatibility between SAS and R provide an opportunity to use R code in SAS for robust modeling. The resulting analysis provides a clear and concise approach for those interested in pursuing sports analytics, as well as, comparing performance between SAS and R. [Presented as ePoster at SAS Analytics Conference, 2015].
Game of Thrones: Text Analysis of the George RR Martin’s book series A Song of Ice and Fire using SAS Enterprise Miner --- Brad Gross and Srividhya Naraharirao --- Louisiana State University
There is a growing need in the world to derive meaningful information from text. SAS Enterprise Miner provides numerous tools that aid in the process of reading in text and turning it into information that is useful for analysis. This presentation details text analysis on the famous book series A Song of Ice and Fire by George RR Martin, which has spawned the hit HBO series Game of Thrones. The books contain a unique narrative structure of switching narrator point of view on a per chapter basis along with having dozens of characters and locations. The analysis attempts to gain interesting insights from its chapters using SAS Text Miner. The SAS dataset considered for analysis contains all 5 books in the series broken into individual chapters with additional metadata including narrator name and family. Using SAS Enterprise Miner, we have performed multiple analyses that includes determining narrator traits based upon common text clusters and factor analysis, determining character qualities based on common words used, and determining relationship strength based on interactions. The analysis leverages a combination of SAS Text Miner nodes including Text Import, Text Parsing, Text Filter, Text Profile, Text Cluster, Text Topic for Pattern Discovery and SAS Enterprise Miner nodes like Filter, Data Partition, Metadata, Regression, and Save Data for data processing and predictive modeling. The results of the analysis shows the ease of use of Enterprise Miner to translate unstructured text into useful information and uncover fascinating patterns that would otherwise be extremely difficult to discern.
Debt Collection Through SAS® Analytics Lens --- Karush Jaggi --- Oklahoma State University
Debt Collection! The two words can trigger multiple images in one’s mind – mostly harsh. However, let’s try and think positive for a moment. In 2013, over $55 billion were past due in the United States. What if all of these were left as is and the fate of credit issuers in the hands of good will payments made by defaulters? Well, not the most sustainable model to say the least. In this situation, debt collection comes in as a tool that is employed at multiple levels of recovery to keep the credit flowing. Ranging from in-house to third party to individual collection efforts, the industry is huge. In the recent past, with financial markets recovering and banks selling less of charged off accounts and at higher prices, collections has increasingly become a game of efficient operations backed by solid analytics. This paper takes you in to the back alleys of all the data that is in there and gives an overview of some ways modeling can be used to impact collection strategy. SAS® tools like Enterprise Miner™ and Enterprise Guide™ are extensively utilized for both data manipulation & modeling. Decision trees are given more focus to understand what factors make the most impact. Along the way, this paper also gives an idea of how analytics teams today are slowly trying to get the ‘buy-in’ from other stake holders in any company which surprisingly is one of the most challenging aspects of our job.
Telecommunications, Power Grids, and Cataclysmic Damage --- Taylor K. Larkin --- The University of Alabama
Coronal Mass Ejections (CMEs) are massive explosions of magnetic field and plasma from the Sun. While responsible for the illustrious northern lights, these eruptions have the potential to cause geomagnetic storms and ensue cataclysmic damage on Earth’s telecommunications systems and power grid infrastructures costing millions of dollars. Hence, it is imperative to construct highly accurate predictive processes to determine whether an incoming CME will produce devastating effects on Earth. One such process called “stacked generalization” trains a variety of models, or base-learners, on a dataset. Then, utilizing the predictions from the base-learners, another model is trained to learn from the metadata. The goal of this meta-learner is to deduce information about the biases from the base-learners to make more accurate predictions. Studies have shown success in using linear methods, especially within regularization frameworks, at the meta-level to combine the base-level predictions. In this work, SAS Enterprise Miner 13.1 is used to reinforce the advantages of regularization on this type of metadata by comparing LASSO to OLS solutions when predicting the occurrence of strong geomagnetic storms caused by CMEs. [Presented as ePoster at SAS Analytics Conference, 2015].
Contextualized Market Basket Analysis – How to learn more from your Point of Sale Data in Base SAS and SAS Enterprise Miner --- Andrew Kramer --- Louisiana State University
With the advent and growth of Point of Sale systems to gather, organize, and store transactional data, researchers began developing methods to discover patterns and useful insight from this new data source. Perhaps most influential was the Apriori Algorithm, originally proposed in the early 1990s with the goal of saving computational power by not looking at all possible subsets of Stock Keeping Units(SKUs), but rather looking at the relationships between SKUs that occur commonly in the dataset. While these recent advances in unsupervised learning have led to sophisticated predictive models such as recommender systems, current market basket analyses fail to address the fundamental question presented by any manager: do the associations create profitable, long-term relationships with customers? The biggest issue retailers face with market basket analyses is that common algorithms ignore several key parameters, such as Customer ID, Date, Quantity Purchased, and Price that many retailers live and die by. While it’s not necessarily wrong to exclude these parameters from the analysis, the interpretation of such results is difficult without this additional context. This paper presents a contextualized way to analyze the results of a market basket analysis by looking at market baskets as a whole, trying to determine how the sale of certain SKUs affects the mix of profitable and unprofitable customers, as well as the overall size and makeup of the market baskets. Macro variables, PROC SQL and Database steps will be discussed in detail, with a focus on implementation both inside and outside of SAS Enterprise Miner.
Re-admission Analysis of COPD Patients --- Saurabh Nandy --- Oklahoma State University
Readmissions are the main cause of concern for the increased medical expenses. Chronic Obstructive Pulmonary Disorder (COPD) being one of the costliest diseases and has some of the expensive medicines for curing it. Due to readmissions, it’s a problematic case for patients as well as hospitals which results in a loss of taxpayer’s money on medicines and hospitalization bills.
In this study, utilizing the electronic medical records (EMR) from Cerner Corporation’s data warehouse, we examined readmission statistics of Medicare patients suffering with different kind of Chronic Obstructive Pulmonary Disorder (COPD) such as simple, mucopurulent and others. We used 2.5 million observations for the analysis. The main purpose of this study was to predict whether patient will get readmitted or not using medication information and demographics information such as age, gender and marital status. We built and compared the predictive models such as Regression, Decision tree, neural network models. Based on average squared error, we found decision tree model performed best to predict the readmission rate. The average square error of decision tree model was 0.03651. Our model can be used by the hospitals and other healthcare sector services to predict in prior the readmission which can help in proper allocation of resources.
SAS: Detecting Phishing Attempts: Minimally Invasive Email Log Data --- Taylor B. Anderson --- The University of Alabama
Phishing is the attempt of a malicious entity to acquire personal, financial, or otherwise sensitive information like usernames and passwords from recipients through the transmission of seemingly legitimate emails. By quickly alerting recipients of known phishing attacks, an organization can reduce the likelihood that a user will succumb to the request and unknowingly provide sensitive information to attackers. Methods to detect phishing attacks typically require the body of each email to be analyzed. Many institutions simply rely on the education and participation of recipients within their network; recipients are encouraged to alert information security (IS) personnel of potential attacks as they are delivered to their mailboxes. In this work, a novel and more automated approach is explored utilizing SAS to examine email header and transmission data to determine likely phishing attempts that can be further analyzed by IS personnel. One collection of email data is examined containing 2,703 emails as seen from a mail filtering appliance. Another collection of known phishing attack emails is used to score against the model. Finally, real time email traffic is exported from Splunk Enterprise into SAS for analysis. The resulting model is examined for its ability to alert IS personnel to potential phishing attempts faster than a user simply forwarding a suspicious email to IS personnel. [Presented as ePoster at SAS Analytics Conference, 2015].
An Insight into Significance of Parameter Estimates: A “Target Shuffling” approach Using Base SAS. --- Alfred Koffi-Sokpa, Nidhi Gupta, and Rahul Roshan --- Louisiana State University
In linear regression, statistical significance tests rely on the level of significance set by the researcher in order to determine whether or not parameter estimates are obtained by chance. We will present our SAS Macro designed to perform Target Shuffling, an empirical method aimed at identifying false positives. Our macro will read a data set, randomly shuffle the target variable only, then conduct a regression analysis, saving the estimated betas. This process will be repeated numerous times, and the estimated betas will be aggregated using the MIANALYZE Procedure. These mean of the estimated parameters should ideally be ZERO (No relationship should occur when only the target variable is shuffled keeping the explanatory variables unchanged). By the very nature of significance tests we will still observe some relationship, which is spurious relative to our original dataset. We will then plot and compare the distribution of the fabricated betas against the original estimated parameters and judge the overlap so that false positives can be detected.