Final Project Directions


Shortcuts: [CS691A Machine Learning]

Important Dates

  • Project Proposal Due Date: 2:00pm on Thursday, November 17, 2016
  • Project Due Date: 5:00pm on Friday, December 9, 2016

Proposal Requirements

Please provide a description of the project you intend to do. It does not need to be a lengthy proposal. I would expect something around a page. It should contain the title, group members names, and email addresses. Each group should submit a single write up. If your project is inspired by a paper please refer to it. Please provide the starting point, which means all the resources you already have (e.g. source code and data), and describe what is it that you need to add to achieve your goals. If you are not going to implement a paper entirely, indicate which part you are going to take care of. I will offer recommendations if you are attempting to do too much or too little. Your proposal should contain a novelty element in order to be approved, and it is your responsibility to highlight it. Therefore, it is ok to build on prior work, but you must clearly state what portion of the project is novel and has not yet been completed.

Please send the proposal write-up via email by the deadline indicated above. You will then be notified whether your proposal is accepted as is, or will need to be revised.

Project Requirements

You are required to chose a topic and develop a project based on it. The general way to do so is to pick a paper that you are interested in, and develop your proposal based on it. You could chose among the papers we covered or mentioned in class, or that are referred to by the reading assignments. You could also look for papers on your own in the machine learning conferences, or also in the computer vision, biometrics, big data, databases, or data mining conferences. However, the requirement is that a machine learning approach must be used in those papers. As stated above, if a paper is very rich, you may want to implement only portions of it, or a simplified version, to set the ground of your project. If you want, you can also expand one of the assignments you already did, but you need to show that you can make extensive improvements. In particular, your project should contain a novelty element that you should pursue as a main goal.

A good strategy to propose something compelling is to apply ideas developed in class or from a machine learning paper of your choice, to your own research (provided that it is not done in a trivial way). You could also compare different approaches (for instance one approach developed in one of the assignments with another one). You could develop your own extensions to a paper.

The projects are executed in teams of two or three people. For projects with only one student you will need to obtain the permission from the instructor. Obviously, a team of two is expected to make double the effort, and so on.


You are required to prepare a project report. It should describe the general approach, what you have done, explain the performance criterions that you have used, explain success cases, failure cases, give reasons for failure / success, explain the new ideas / extensions you have come up with.

You should deliver the project report, and the code by the deadline indicated above. If possible, you should also deliver the data.

The project will be graded out of 100 points. There will be no extra points since you are defining the specifications of your project. Obviously, the more work you do and the highest the likelihood to secure an A. The members of a given team will receive the same grade.

To turn in your assignment, you will need to deliver a zipped file (use these guidelines to deliver your files) containing the following:

[65 points] Code: Please supply all the code and the data you used to generate your results (if data is not included please provide a justification), along with a README file to allow me to test it out.

[35 points] Report

Guidelines to Deliver your Project

You will need to turn in your assignment via Dropbox.

If you DO NOT have a Public folder, after you have created your Dropbox account, please follow the instructions in this page to Create a shared link, and create a link to your zip file. After the creation of the link please send it to the course instructor.

If you have a Public folder, after you have created your Dropbox account, please do the following:

  • Move the zip archive inside the Public folder, which is located inside your Dropbox folder.
  • Right-click on the zip file, go inside the Dropbox menu, and select Copy Public Link. You can also consult these instructions.
  • Open your email client and start writing a new email, then paste the Public Link to the zip file in the email body.
  • Send the email to the course instructor.

Project Ideas Examples

Some of those project ideas have been inspired by projects developed at CMU by students taking a class similar to this one.
  • Semi-supervised learning studies algorithms which learn from a small amount of labeled data and a large pool of unlabeled data. Semi-supervised learning algorithms typically make an assumption about the data distribution. For example, several algorithms assume that the decision boundary should not pass through regions with high data density. When this assumption is satisfied, the algorithms perform better than supervised learning. The goal of this project is to experiment with semi-supervised learning algorithms on a data set of your choice. Some algorithms you can consider using are described here, under the names of co-training, self-training, transductive SVMS (S3VMs), or one of the many graph-based algorithms. You may compare several semi-supervised and supervised algorithms on your data set, and attempt drawing some general conclusions about semi-supervised learning. The project can use any data set. The UC Irvine Machine Learning Repository contains several choices.
  • Low-rank matrix factorization has a variety of applications, such as collaborative filtering and image completion. The goal is to fill in the missing entries of a data matrix. Traditional low-rank matrix factorization methods (e.g. the SVD), minimize the L2 loss, i.e., the sum of squared residuals between the known entries and the reconstruction given by a product of two low rank matrices. It is known, however, that the L2 loss is sensitive to outliers. One remedy is given by replacing the L2 loss by the L1 loss, which is known to be more robust against outliers. In computer vision two approaches ([A], and [B]) have been proposed for low-rank matrix factorization under the L1 loss, which have demonstrated promising results. One project could have the goal of implementing one of the two algorithms, and compare it against a state-of-the-art method, which minimizes the L2 norm. Another option would be to implement both algorithms and compare one against each other. A third project idea would be to extend one of the two L1 algorithms by adding regularization terms for the factors composing the matrix factorization, and compare the extension against the original algorithm. The idea for each of those projects woudl be to use the algorithms for collaborative filtering, for which there is a widely used benchmark datasets, the MovieLens.
  • We have learned that parameter estimation for Hidden Markov Models (HMMs) can be carried out with the Baum-Welch algorithm (EM), which we know may converge onto a local optima. Newer approaches, such as [C] and [D], can avoid this problem. The goal of this project is to implement both of these algorithms (code for [D] is available), and compare them against each other, or against the traditional Baum-Welch algorithm. The comparison should be done on datasets different from those used in the original papers. Dynamic dataset examples include brain wave signals, the CMU motion capture dataset, the time series data of the UC Irvine Machine Learning Repository, or any other dataset of your choice.
  • The problem of information extraction from text is very important. The goal of this project is to extract person names from an email. You can download a dataset of emails from here. You could start by looking at this paper [E], and compare it with a newer approach. This can be seen also as a Sequential Labeling problem, where an email is composed by a sequence of tokens, and each token can have either a label of "person-name" or "not-a-person-name".
  • The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The goal of this project would be to implement and compare a number of classification approaches that attempt to identify the person that sent the email based solely on the email body.