CS 461 Final Project
Due Dates
- Proposal: January 24 (midnight, on CSNS)
- Presentation and report: March 14 (presentation in class, report at midnight on CSNS)
This project provides you with a chance to investigate a machine learning problem of personal interest. You'll select or create a data set, select at least two machine learning algorithms, and compare the algorithms' ability to solve the problem.
This project is worth 100 points and counts for 25% of your final grade.
Part 1: Proposal (30 points; due January 24 at midnight)
Before you begin work on the project itself, you'll want to plan out your goals and how you will approach the problem. To that end, you will write a project proposal and submit it for comment and feedback.
I encourage you to discuss your ideas with me ahead of time, in office hours or by email.
The proposal must be submitted in text or PDF format and must contain the following items, in this order:
Standard assignment header: your name, the class name and number (CS 461, Machine Learning), the quarter (Winter 2009), the name of the assignment (Project Proposal), and the assignment due date (January 24, 2009).
Presentation preference: Please indicate whether you prefer an oral presentation plus short written report, or a single longer written report (see below for details). If you want to ensure that you get an oral presentation, you can email me this preference earlier.
Problem description (1 paragraph). Provide enough detail that someone with no experience in the problem area can still see what the learning goal is. The problem could be taken from your daily life, a personal hobby, or be completely fictional. (You may find it easier, however, if it involves a real situation that you have experience with). It should be a supervised problem that requires classification (predict a class label) or regression (predict a real-valued quantity).
Data set description (1 paragraph). Your data set must include at least 100 observations (instances).
- Indicate the source of the data set, if you are using an existing data set, or note that you will be collecting the data yourself. There are links to standard data sets available on the "Links" page of the course website. If you create your own data set, you must store it in the ARFF format. If you have questions about this format, let me know.
- Include a list of the features you decide to use to represent the data.
- For each feature, specify its type (numeric or symbolic) and range of values.
Your choice of Machine learning algorithms to solve the problem. Choose at least two of: k-nearest neighbors, decision trees, support vector machines, neural networks, or naive Bayes.
Also indicate how you will obtain an implementation of these algorithms to run your experiments: use an implementation provided by Weka, implement the algorithm yourself, or from some other source. (1 paragraph total)
Your experimental plan (1 paragraph).
- What kind of evaluation will you use (e.g., separate training and test data sets, cross-validation, manual examination of results, user study)?
- What quantitative metrics will you use to evaluate performance? For example, you might use classification accuracy or error rate for a classification problem or RMSE (root of the mean squared error) for a regression problem. See Alypadin Chapter 14.
- Include at least one baseline to which your algorithms' performance can be further compared. This should be a simple approach to the problem, such as predicting the most common class (for classification) or the mean output value (for regression), etc.
Submit your project proposal in a file named
<yourlastname>-proposal.txt
(or <yourlastname>-proposal.pdf
) to
CSNS by midnight on January 24.
Part 2: Project and Report (70 points; due March 14 at midnight)
Once you have received feedback from me on your proposal, you should proceed to collect the data, set up your experiments, and evaluate the results.
You can either:
Prepare a short (5-minute) oral presentation (using the whiteboard, Powerpoint, or other means) describing your project and results, and write a 2-page (double-spaced) report with additional details (due at midnight), or
Skip the oral presentation and write a 5-page (double-spaced) report (due at midnight) summarizing your results.
There is a limited number of oral presentation slots, so if they fill up (order determined by time of submission of proposal, or you can email me earlier to claim a slot) then you will need to write the 5-page report.
The oral presentation should cover:
Problem description (1 minute): Based on your proposal and refined as recommended by my feedback on the proposal.
Data set description (30 seconds). Based on your proposal and refined as recommended by my feedback on the proposal.
Your choice of Machine learning algorithms to solve the problem. Based on your proposal and refined as recommended by my feedback on the proposal.
Summary of experimental results (2-3 minutes). Report quantitative performance results, describe the baseline to which you compared, and include at least one figure to show graphically what performance you observed (see below for what kind of figures you might include).
Written reports (of either length) should include:
Standard assignment header: your name, the class name and number (CS 461, Machine Learning), the quarter (Winter 2009), the name of the assignment (Final Project Report), and the assignment due date (March 14, 2009).
Problem description: Copied from your proposal and refined as recommended by my feedback on the proposal.
Data set description. Copied from your proposal and refined as recommended by my feedback on the proposal.
Your choice of Machine learning algorithms to solve the problem. Copied from your proposal and refined as recommended by my feedback on the proposal.
Summary of experimental results. Unlike section 4 in your proposal, this part should describe what you actually did (sometimes experiments change from the original plan!). You must include:
- What kind of evaluation you used (e.g., separate training and test data sets, cross-validation, manual examination of results, user study)?
- A description of the results comparing your algorithms' performance to at least one baseline. Remember to indicate what metrics you are using to measure performance. Are your algorithm and the baseline statistically significantly different in their performance? Include a confusion matrix (for classification problems).
- At least one figure. Note: The figure does
not count towards the page requirement. The figure you choose
to present is up to you; select a figure that will best capture the
results of your experiments. Here are some ideas:
- Classification:
- Plot of classification test accuracy versus training data set size (does more data improve performance?)
- Plot of classification test accuracy versus some parameter (such as k for k-NN, C for SVMs, γ for SVM RBF kernels, etc.)
- Bar plot showing test performance for different kernel types for SVMs
- Regression:
- Plot of test RMSE versus training data set size
- Bar plot showing test performance for different kernel types for SVMs
- Classification:
- Analysis of the results. Where did your algorithm perform well (which instances)? What kind of errors did it make? What can you conclude about the problem you were trying to solve? What did you learn?
Point breakdown for reports:
If you are doing an oral presentation, your score is a combination of your ability to communicate these items orally and in written form. Otherwise, your score comes entirely from your written report.
- Problem description: 10
- Data set description (and data set itself, if you created your own): 10
- Machine learning algorithm descriptions: 10
- Experimental results (includes results, analysis, and baseline): 30
- Figure: 10
Total: 70 points
Submit your project report in a file named
<yourlastname>-report.txt
(or <yourlastname>-report.pdf
) to
CSNS by midnight on March 14. If you need to submit your figures separately, upload them as PNG, PDF, JPG, or GIF.
You must also submit your data file, in ARFF format, to CSNS.