CS 461: Machine Learning
Instructor: Kiri Wagstaff

CS 461 Final Project

Due Dates

This project provides you with a chance to investigate a machine learning problem of personal interest. You'll select or create a data set, select a machine learning algorithm, and evaluate that algorithm's ability to solve the problem.

This project is worth 100 points and counts for 25% of your final grade.


Part 1: Proposal (30 points; due 1/19 at midnight)

Before you begin work on the project itself, you'll want to plan out your goals and how you will approach the problem. To that end, you will write a project proposal and submit it for comment and feedback.

I encourage you to discuss your ideas with me ahead of time, in office hours or by email.

The proposal must be submitted in text or PDF format and must contain the following items, in this order:

  1. Standard assignment header: your name, the class name and number (CS 461, Machine Learning), the quarter (Winter 2008), the name of the assignment (Project Proposal), and the assignment due date (January 19, 2008).

  2. Problem description (1 paragraph). Provide enough detail that someone with no experience in the problem area can still see what the learning goal is. The problem could be taken from your daily life, a personal hobby, or be completely fictional. (You may find it easier, however, if it involves a real situation that you have experience with). It should be a supervised problem that requires classification (predict a class label) or regression (predict a real-valued quantity).

  3. Data set description (1 paragraph). Your data set must include at least 100 observations (instances).

    • Indicate the source of the data set, if you are using an existing data set, or note that you will be collecting the data yourself. There are links to standard data sets available on the "Links" page of the course website. If you create your own data set, you must store it in a standard format: a text file consisting of one instance per line, with feature values separated by whitespace or commas. If you have questions about this format, let me know.
    • Include a list of the features you decide to use to represent the data.
    • For each feature, specify its type (numeric or symbolic) and range of values.
  4. Your choice of Machine learning algorithm to solve the problem. Choose one of: k-nearest neighbors, decision trees, support vector machines, neural networks, or naive Bayes.

    Include a justification for why you think this algorithm will help solve the problem you described in your Problem Description; consider space and time efficiency in making your choice. Also indicate how you will obtain an implementation of the algorithm to run your experiments: use an implementation provided by Weka, implement the algorithm yourself, or from some other source. (1 paragraph total)

  5. Your experimental plan (1 paragraph).

    • What kind of evaluation will you use (e.g., separate training and test data sets, cross-validation, manual examination of results, user study)?
    • What quantitative metrics will you use to evaluate performance? For example, you might use classification accuracy or error rate for a classification problem or RMSE (root of the mean squared error) for a regression problem. See Alypadin Chapter 14.
    • Include at least one baseline to which your algorithm's performance can be compared. This should be a simple approach to the problem, such as predicting the most common class (for classification) or the mean output value (for regression), etc.

Submit your project proposal in a file named <yourlastname>-proposal.txt (or <yourlastname>-proposal.pdf) to CSNS by midnight on January 19.


Part 2: Project (70 points; due March 8)

Once you have received feedback from me on your proposal, you should proceed to collect the data, set up your experiments, and evaluate the results.

You can either:

  1. Prepare a short (5-minute) oral presentation (using the whiteboard, Powerpoint, or other means) describing your project and results, and write a 2-page (double-spaced) report with additional details (due at midnight), or

  2. Skip the oral presentation and write a 4-page (double-spaced) report (due at midnight) summarizing your results.

There is a limited number of oral presentation slots, so if they fill up then you will need to write the 4-page report.

The oral presentation should cover:

  1. Problem description (1 minute): Based on your proposal and refined as recommended by my feedback on the proposal.

  2. Data set description (30 seconds). Based on your proposal and refined as recommended by my feedback on the proposal.

  3. Your choice of Machine learning algorithm to solve the problem and its justification (30 seconds). Based on your proposal and refined as recommended by my feedback on the proposal.

  4. Summary of experimental results (2-3 minutes). Report quantitative performance results, describe the baseline to which you compared, and include at least one figure to show graphically what performance you observed (see below for what kind of figures you might include).

Reports (of either length) should include:

  1. Standard assignment header: your name, the class name and number (CS 461, Machine Learning), the quarter (Winter 2008), the name of the assignment (Final Project Report), and the assignment due date (March 8, 2008).

  2. Problem description: Copied from your proposal and refined as recommended by my feedback on the proposal.

  3. Data set description. Copied from your proposal and refined as recommended by my feedback on the proposal.

  4. Your choice of Machine learning algorithm to solve the problem and its justification. Copied from your proposal and refined as recommended by my feedback on the proposal.

  5. Summary of experimental results. Unlike section 4 in your proposal, this part should describe what you actually did (sometimes experiments change from the original plan!). You must include:

    1. What kind of evaluation you used (e.g., separate training and test data sets, cross-validation, manual examination of results, user study)?
    2. A description of the results comparing your algorithm's performance to at least one baseline. Remember to indicate what metrics you are using to measure performance. Are your algorithm and the baseline statistically significantly different in their performance? Include a confusion matrix (for classification problems).
    3. At least one figure. Note: The figure does not count towards the page requirement. The figure you choose to present is up to you; select a figure that will best capture the results of your experiments. Here are some ideas:
      • Classification:
        • Plot of classification accuracy versus training data set size (does more data improve performance?)
        • For two-class problems: ROC curve to show performance as a parameter is varied (such as k for k-NN, C for SVMs, training set size, etc.)
        • Bar plot showing performance for different kernel types for SVMs
      • Regression:
        • Plot of RMSE versus training data set size
        • Bar plot showing performance for different kernel types for SVMs
    4. Analysis of the results. Where did your algorithm perform well (which instances)? What kind of errors did it make? What can you conclude about the problem you were trying to solve?

Point breakdown for reports:

If you are doing an oral presentation, your score is a combination of your ability to communicate these items orally and in written form. Otherwise, your score comes entirely from your written report.

Total: 70 points

Submit your project report in a file named <yourlastname>-report.txt (or <yourlastname>-report.pdf) to CSNS by midnight on March 8.