Assignment 2

The assignment

DATASETS

3: Spam data

The Data

This data set comes from a collection of 5000 personal email messages, 1000 which are used for the training set, and 4000 for the test set.

Each spam message was reduced to 185 binary features. The text strings associated with these 185 features are included in the "feature_names" variables. Each message is thus represented by a vector of 185 binary values, i.e., a row in the "data_train" and "data_train" vectors. Your goal is to learn two classifiers that takes a 185-vector and returns a class label. One classifier will use Logistic Regression, and one will use Naiïve Bayes.

The "labels_test" and "labels_train" data sets are binary features indicating which of the emails are spam and which are ham. We'll leave it to you to figure out whether 0 or 1 indicates "spam." (Feel free to discuss whether it's 0 or 1 on the online bulletin board.)

Thanks to Sam Roweis for providing this data.

MATLAB hints:

You can use the "find" command to separate the training sets, e.g., data_train(find(labels_train)==0,:) will give you all the data from class 0.

You can use the 'sort' command to find the highest and lowest weights, and then get the corresponding indices from the list of labels.

5: PCA data

Here are two mocap datasets (they will be shown in class later):

You should "center" the data using this function: centerpose.

Here is a function you can use to load a data file: loadmotion.m.

This viewer can be used to show the data is Mosey; you are not expected to figure out how to use it, but you're welcome to try it if you like. If you would like to experiment with saving data to a file, you can use this function: savemotion.m You can also visualize marker positions in MATLAB directly with the "plot3" command.

The assignment asks for a 2D plot of the low-dimensional coordinates. Here is an example of what such a plot will look like for the walking data:

It is also quite interesting to plot a 3D PCA projection and rotate the view.