labs / machine-learning-lab.txt
COMP 210 Labs: New for 2022

Machine Learning and Information Assurance:

	R or Python? Both are available in the labenv. R may be a bit more friendly/simple to beginners.
	Loading and understanding a data set
	Exploratory Data Analysis

		Performing basic descriptive statistics
			Typical values: mean, median, mode
			Standard deviation
		Basic plotting (maybe provide some helper functions for common cases)
		Correlation matrix
	Missing values?
	General machine learning background

	Wireshark analysis?
		stateful analysis?

	[ ] Diagram: training and testing division, general ML pipelines

	POPFile? Maybe tricky: it's old, the official Web site is down/gone, and the proxy mode could be a pain to work with in the labenv (have to run a mail client). Also, training manually on e-mails could be a nuisance.
		Instead, maybe do Naive Bayes classification within R

		Might be good to do that first. Introduce confusion matrix concept too. Have students consider the relative importance of incorrect classification, esp. "ham" as "spam"
		Could even use one of the SMS datasets - would be a bit simpler and easier and clearer, no?
			good diagrams there - use those
		Link to Paul Graham's articles:

		Exercises (commented R script I think rather than an R Markdown file, for simplicity)
		 - Extract what charcaters are present in the messages
		[ ] Questions:
		 - How to tokenise? Whether to use word stems or just leave as-is? What about case-sensitivity?
			Maybe have them try different preprocessing and see whether it actually helps. Things like SHOUTING are often red flags for spam, and casual SMS use often doesn't involve conventional capitalisation, punctuation, spelling.
		 - What limitations might there in training a general-purpose spam-filtering e-mail classifier from a public dataset? Or, what advantages might there be to training your own classifier on your own messages?
		 - The SMS data set is very simple, containing only the message and the class (ham/spam). What other information (metadata) would normally be available with SMS text messages? Would there be any advantage to including those in the classifier as well?
			timestamp of message
			phone number of sender
			phone number of recipient
		 - Imagine you wanted to classify e-mail messages instead of SMS messages. What further metadata would generally be available for e-mails? What preprocessing might be necessary/appropriate?


Software Engineering and Information Assurance:

	Risk assessment modelling/calculations
	Threat models
	Code reviews
	Source code management
		Git, GitHub, GitBucket (do we need them to create user accounts?)
		git blame, git bisect?