
Final Project Directions


Shortcuts: [CS691A
Machine Learning]


Important
Dates
 Project Proposal Due Date: 2:00pm on Thursday,
November 17, 2016
 Project Due Date: 5:00pm on Friday, December 9,
2016


Proposal Requirements
Please provide a description of the project you intend to do.
It does not need to be a lengthy proposal. I would expect
something around a page. It should contain the title, group
members names, and email addresses. Each group should submit a
single write up. If your project is inspired by a paper please
refer to it. Please provide the starting point, which means all
the resources you already have (e.g. source code and data), and
describe what is it that you need to add to achieve your goals.
If you are not going to implement a paper entirely, indicate
which part you are going to take care of. I will offer
recommendations if you are attempting to do too much or too
little. Your proposal should contain a novelty element in order
to be approved, and it is your responsibility to highlight it.
Therefore, it is ok to build on prior work, but you must clearly
state what portion of the project is novel and has not yet been
completed.
Please send the proposal writeup via email by the deadline
indicated above. You will then be notified whether your proposal
is accepted as is, or will need to be revised.


Project Requirements
You are
required to chose a topic and develop a project based on it. The
general way to do so is to pick a paper that you are interested
in, and develop your proposal based on it. You could chose among
the papers we covered or mentioned in class, or that are
referred to by the reading assignments. You could also look for
papers on your own in the machine learning conferences, or also
in the computer vision, biometrics, big data, databases, or data
mining conferences. However, the requirement is that a machine
learning approach must be used in those papers. As stated above,
if a paper is very rich, you may want to implement only portions
of it, or a simplified version, to set the ground of your
project. If you want, you can also expand one of the assignments
you already did, but you need to show that you can make
extensive improvements. In particular, your project should
contain a novelty element that you should pursue as a main goal.
A good strategy to propose something compelling is to apply
ideas developed in class or from a machine learning paper of
your choice, to your own research (provided that it is not done
in a trivial way). You could also compare different approaches
(for instance one approach developed in one of the assignments
with another one). You could develop your own extensions to a
paper.
The projects are executed in teams of two or three
people. For projects with only one student you will need to
obtain the permission from the instructor. Obviously, a team of
two is expected to make double the effort, and so on.


Deliverables
Report
You are required to prepare a project report. It should
describe the general approach, what you have done, explain the
performance criterions that you have used, explain success
cases, failure cases, give reasons for failure / success,
explain the new ideas / extensions you have come up with.
Deliverables
You should deliver the project report, and the code by
the deadline indicated above. If possible, you should also
deliver the data.
The project will be graded out of 100 points. There will be no
extra points since you are defining the specifications of your
project. Obviously, the more work you do and the highest the
likelihood to secure an A. The members of a given team will
receive the same grade.
To turn in your assignment, you will need to deliver a zipped
file (use these guidelines to deliver
your files) containing the following:
[65 points] Code: Please supply all the code and the data you
used to generate your results (if data is not included please
provide a justification), along with a README file to allow me
to test it out.
[35 points] Report


Guidelines to Deliver your Project
You will need to turn in your assignment via Dropbox.
If you DO NOT have a Public folder, after you have created your
Dropbox account, please follow the instructions in this
page to Create a shared link, and create
a link to your zip file. After the creation of the link please
send it to the course instructor.
If you have a Public folder, after you have created your
Dropbox account, please do the following:
 Move the zip archive inside the Public folder, which is
located inside your Dropbox folder.
 Rightclick on the zip file, go inside the Dropbox menu, and
select Copy Public Link. You can also consult these
instructions.
 Open your email client and start writing a new email, then
paste the Public Link to the zip file in the email body.
 Send the email to the course
instructor.


Project Ideas Examples
Some of
those project ideas have been inspired by projects developed at
CMU by students taking a class similar to this one.
 Semisupervised learning studies algorithms which learn from
a small amount of labeled data and a large pool of unlabeled
data. Semisupervised learning algorithms typically make an
assumption about the data distribution. For example, several
algorithms assume that the decision boundary should not pass
through regions with high data density. When this assumption
is satisfied, the algorithms perform better than supervised
learning. The goal of this project is to experiment with
semisupervised learning algorithms on a data set of your
choice. Some algorithms you can consider using are described here,
under the names of cotraining, selftraining, transductive
SVMS (S3VMs), or one of the many graphbased algorithms. You
may compare several semisupervised and supervised algorithms
on your data set, and attempt drawing some general conclusions
about semisupervised learning. The project can use any data
set. The UC
Irvine Machine Learning Repository contains several
choices.
 Lowrank matrix factorization has a variety of applications,
such as collaborative filtering and image completion. The goal
is to fill in the missing entries of a data matrix.
Traditional lowrank matrix factorization methods (e.g. the
SVD), minimize the L2 loss, i.e., the sum of squared residuals
between the known entries and the reconstruction given by a
product of two low rank matrices. It is known, however, that
the L2 loss is sensitive to outliers. One remedy is given by
replacing the L2 loss by the L1 loss, which is known to be
more robust against outliers. In computer vision two
approaches ([A],
and [B])
have been proposed for lowrank matrix factorization under the
L1 loss, which have demonstrated promising results. One
project could have the goal of implementing one of the two
algorithms, and compare it against a stateoftheart method,
which minimizes the L2 norm. Another option would be to
implement both algorithms and compare one against each other.
A third project idea would be to extend one of the two L1
algorithms by adding regularization terms for the factors
composing the matrix factorization, and compare the extension
against the original algorithm. The idea for each of those
projects woudl be to use the algorithms for collaborative
filtering, for which there is a widely used benchmark
datasets, the MovieLens.
 We have learned that parameter estimation for Hidden Markov
Models (HMMs) can be carried out with the BaumWelch algorithm
(EM), which we know may converge onto a local optima. Newer
approaches, such as [C]
and [D],
can avoid this problem. The goal of this project is to
implement both of these algorithms (code for [D] is
available), and compare them against each other, or against
the traditional BaumWelch algorithm. The comparison should be
done on datasets different from those used in the original
papers. Dynamic dataset examples include brain
wave signals, the CMU
motion capture dataset, the time series data of the UC
Irvine Machine Learning Repository, or any other dataset
of your choice.
 The problem of information extraction from text is very
important. The goal of this project is to extract person names
from an email. You can download a dataset of emails from here.
You could start by looking at this paper [E],
and compare it with a newer approach. This can be seen also as
a Sequential Labeling problem, where an email is composed by a
sequence of tokens, and each token can have either a label of
"personname" or "notapersonname".
 The Enron
Email data set contains about 500,000 emails from
about 150 users. The goal of this project would be to
implement and compare a number of classification approaches
that attempt to identify the person that sent the email based
solely on the email body.
