Newer
Older
labs / machine-learning-lab.txt
COMP 210 Labs: New for 2022

Machine Learning and Information Assurance:

	R or Python? Both are available in the labenv. R may be a bit more friendly/simple to beginners.
	
	Loading and understanding a data set
		
	Exploratory Data Analysis

		Performing basic descriptive statistics
		
			Typical values: mean, median, mode
			
			Standard deviation
			
		
		Basic plotting (maybe provide some helper functions for common cases)
		
		Correlation matrix
	
	Missing values?
	
	Outliers?
	
	
	General machine learning background
	

	Wireshark analysis?
		https://community.rti.com/static/documentation/wireshark/2020-07/doc/examples.html
		stateful analysis?


	http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html


	[ ] Diagram: training and testing division, general ML pipelines

	POPFile? Maybe tricky: it's old, the official Web site is down/gone, and the proxy mode could be a pain to work with in the labenv (have to run a mail client). Also, training manually on e-mails could be a nuisance.
	
		Instead, maybe do Naive Bayes classification within R

		Might be good to do that first. Introduce confusion matrix concept too. Have students consider the relative importance of incorrect classification, esp. "ham" as "spam"
		
		Could even use one of the SMS datasets - would be a bit simpler and easier and clearer, no?
		Ah:
		https://hohenfeld.is/posts/creating-a-naive-bayes-spam-filter-in-r/
		https://www.r-bloggers.com/2021/04/naive-bayes-classification-in-r/
		https://www.enjoyalgorithms.com/blog/email-spam-and-non-spam-filtering-using-machine-learning
			good diagrams there - use those
		
		https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
		https://spamassassin.apache.org/old/publiccorpus/
		https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
		
		Link to Paul Graham's articles:
			http://www.paulgraham.com/spam.html
			http://www.paulgraham.com/better.html


		Exercises (commented R script I think rather than an R Markdown file, for simplicity)
		
		 - Extract what charcaters are present in the messages
		 
		[ ] Questions:
		
		 - How to tokenise? Whether to use word stems or just leave as-is? What about case-sensitivity?
		 
			Maybe have them try different preprocessing and see whether it actually helps. Things like SHOUTING are often red flags for spam, and casual SMS use often doesn't involve conventional capitalisation, punctuation, spelling.
		 
		 - What limitations might there in training a general-purpose spam-filtering e-mail classifier from a public dataset? Or, what advantages might there be to training your own classifier on your own messages?
		 
		 - The SMS data set is very simple, containing only the message and the class (ham/spam). What other information (metadata) would normally be available with SMS text messages? Would there be any advantage to including those in the classifier as well?
		 
			timestamp of message
			phone number of sender
			phone number of recipient
		
		 - Imagine you wanted to classify e-mail messages instead of SMS messages. What further metadata would generally be available for e-mails? What preprocessing might be necessary/appropriate?


--

Software Engineering and Information Assurance:

	Risk assessment modelling/calculations
	
	Threat models
	
	Code reviews
	
	Source code management
	
		Git, GitHub, GitBucket (do we need them to create user accounts?)
		
		Issue-tracking
		
		git blame, git bisect?