The purpose of this project is to follow the process of going from data to knowledge using a data

set that applies to a real-world problem. For this project, you will form teams of 4 to 5 students.

Your team’s objective is to locate a data set in order to help solve a specific problem. This

means that when locating a data set, you should be thinking about an impactful problem that

working with this data set would solve and how this data set will allow you to work towards

solving the problem. You may use any data mining software packages or libraries you wish for

performing data mining tasks and any programming language for cleaning and pre-processing

the data

Project Report [8-10 pages, 2500-3500 words]

• Executive Summary – give a high-level description of the problem and what will be

included in the remainder of the report. Be sure to mention the overall results of the project

as part of this summary!

• Problem Description – give a 2-3 sentence description of the problem your team solved.

• Data Exploration

o Source(s) of the data – provide link(s) and any background on the data that might

be of interest. For example, some data is “simulated” and did not come from a real-

world source.

o Number of records

o Attribute description – begin with a high-level description of each attribute,

including what the attribute is, its type (continuous, discrete). Using a table to do this

is ideal. Next, for each attribute, provide summary statistics (min, max, mean,

standard deviation, frequency, mode). There should be figures/plots for at least some

of the attributes, especially ones that appear to be interesting.

o Missing values – did you find any missing values? If so, how do you plan to deal

with these?

o Outliers – did you find any outliers? If so, how do you plan to deal with these?

• Data Preprocessing – which preprocessing steps did you use? Below is some guidance on

what to write about if you performed any of the following steps. Only include

preprocessing steps that you did as part of your project.

o Discretization – show the discretization scheme you used and the new distribution

for any attributes you discretized.

o Sampling – describe the process you used and show the summary statistics of the

attributes for the sampled data set.

o Aggregation – describe the process used for aggregating data within an attribute

and show the new distribution. Give reasoning for why you did this.

o Dimensionality Reduction/Feature Selection – how/why did you decide to remove

features from the data set. What was the result of having done this?

• Data Mining Techniques/Algorithms Used – Describe the techniques and algorithms used

at a high level and why you decided to use them.

• Results – Perform an appropriate analysis of results. For example, discuss errors in

classification models using confusion matrices. The purpose here is to compare the results

obtained from various models or approaches that you tried.

• Conclusions and Lessons Learned – What are the major takeaways from this project in

terms of how well you were able to solve the problem you stated. What did you learn from

working on the project together as a team?

Data Mining Tools/Languages/Libraries

• R – https://www.r-project.org

• Weka – https://www.cs.waikato.ac.nz/ml/index.html

• PowerBI – https://powerbi.microsoft.com/en-us/

Examples of Past Projects

IMDB movie earnings prediction – https://www.imdb.com/interfaces/

Credit card customer default prediction –

https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Predicting success of Kickstarter Campaigns – https://www.kaggle.com/kemical/kickstarter-

projects

Predicting video game sales – https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings

# The purpose of this project is to follow the process of going from data to knowledge using a data set that applies to a real-world problem. For this project, you will form teams of 4 to 5 students. Your team’s objective is to locate a data set in order to help solve a specific problem. This means that when locating a data set, you should be thinking about an impactful problem that working

by

Tags: