COMP 210 Labs: New for 2022 Machine Learning and Information Assurance: R or Python? Both are available in the labenv. R may be a bit more friendly/simple to beginners. Loading and understanding a data set Exploratory Data Analysis Performing basic descriptive statistics Typical values: mean, median, mode Standard deviation Basic plotting (maybe provide some helper functions for common cases) Correlation matrix Missing values? Outliers? General machine learning background Wireshark analysis? https://community.rti.com/static/documentation/wireshark/2020-07/doc/examples.html stateful analysis? http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [ ] Diagram: training and testing division, general ML pipelines POPFile? Maybe tricky: it's old, the official Web site is down/gone, and the proxy mode could be a pain to work with in the labenv (have to run a mail client). Also, training manually on e-mails could be a nuisance. Instead, maybe do Naive Bayes classification within R Might be good to do that first. Introduce confusion matrix concept too. Have students consider the relative importance of incorrect classification, esp. "ham" as "spam" Could even use one of the SMS datasets - would be a bit simpler and easier and clearer, no? Ah: https://hohenfeld.is/posts/creating-a-naive-bayes-spam-filter-in-r/ https://www.r-bloggers.com/2021/04/naive-bayes-classification-in-r/ https://www.enjoyalgorithms.com/blog/email-spam-and-non-spam-filtering-using-machine-learning good diagrams there - use those https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset https://spamassassin.apache.org/old/publiccorpus/ https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection Link to Paul Graham's articles: http://www.paulgraham.com/spam.html http://www.paulgraham.com/better.html Exercises (commented R script I think rather than an R Markdown file, for simplicity) - Extract what charcaters are present in the messages [ ] Questions: - How to tokenise? Whether to use word stems or just leave as-is? What about case-sensitivity? Maybe have them try different preprocessing and see whether it actually helps. Things like SHOUTING are often red flags for spam, and casual SMS use often doesn't involve conventional capitalisation, punctuation, spelling. - What limitations might there in training a general-purpose spam-filtering e-mail classifier from a public dataset? Or, what advantages might there be to training your own classifier on your own messages? - The SMS data set is very simple, containing only the message and the class (ham/spam). What other information (metadata) would normally be available with SMS text messages? Would there be any advantage to including those in the classifier as well? timestamp of message phone number of sender phone number of recipient - Imagine you wanted to classify e-mail messages instead of SMS messages. What further metadata would generally be available for e-mails? What preprocessing might be necessary/appropriate? -- Software Engineering and Information Assurance: Risk assessment modelling/calculations Threat models Code reviews Source code management Git, GitHub, GitBucket (do we need them to create user accounts?) Issue-tracking git blame, git bisect?