Aleatha Parker-Wood, PhD Keynote HuMu, Symantec, many security-related patents
BYOD (bring your own device) - bigger attack surface
DLP (data loss prevention) - need more data, but that expands your attack surface
Need strict ACLs - have to avoid letting marketing use data intended only for security models
Encryption is not a magic bullet
"You can't store enough data to protect all your data"
Have to worry about model inversion attacks
"privacy for all users... except the bad ones"
Incremental learning, e.g. Bloom filter
Sketching - learning summary statistics as you go, e.g. rolling averages, hyperloglog Robust against threats, low space needs
Online learning, e.g. Hoeffding is an online decision tree, or stochastic gradient descent on mini-batches Harder to do model selection because "you can't go back to the original data" (or have to store some)
Data poisoning attacks - big area of research, see IEEE Security & Privacy
Differential Privacy - ref Dwork 2006 (not available for free)
Carefully calibrated noise to cover user identity Add a randomization factor (unbiased error) - epsilon "probability correction", and some also add a delta In theory it sounds great, but she says in practice, not so much
ex. Palpatine is CEO, Padme is Eng, Anakin is Sales Differencing Attack - Anakin didn't respond to the survey, so Palpatine can easily identify Padme's answers
Solution: use a "high water mark" tl;dr over-report just a little all the time (with fake data I guess?)
2020 Census will use differential privacy; Apple uses it for predictive text
Advantages: prevents overfitting (Abadi 2016), protects against breaches, poisoning, and insider attacks Disavantages: suppresses outliers (so not great for anomaly detection or security investigations), choosing epsilon is "a black art", requires more data because noisy by design
Private multi-party ML - distributed learning across mutually distrusting systems
ex. predictive text where data stays local to the phone, great for GDPR
Different kinds of solutions:
-
Secure differential privacy - fast, but accuracy issues
-
Secure multiparty computation (SMC) - high I/O overhead (due to multi-round communication required), but higher accuracy
-
Homomorphic encryption - special case of SMC -- never decrypt
Truex et al. (paper coming out soon) - hybrid of differential privacy + homomorphic encryption
Papernot et al. PATE - Private Aggregation of Teacher Ensembles. Learn locally on private data, predict on public. Accuracy and efficiency are challenges. Problems on the mathematical side e.g. encryption.
@aleatha aleatha@humu.com
Felipe Ducau - Describing Malware via Tagging Sophos - been there 2.5 years
Tokenize information about malware - what type it is, whether it's compressed, etc.
deep learning approaches - binary entropy model Trained a model on 1 year of data (10M or 76M?), validation set 3M, test set 3.8M Ended up with 20 features - ref Saxe and Berlin 2015
Multi-head neural nets get 96% coverage explode to get more tokens, then collapse back down
joint embedding was actually better and more interpretable, comes from computer vision use dot product for distance to get relationships between tags mean TPR (true positive rate) of 0.88, overall 0.71 AUC 0.99 pre-print is available on arXiv: https://arxiv.org/abs/1905.06262
ALOHA- auxiliary loss optimization for hypothesis augmentation, ref. Rudd et al. 2019 video is here: https://www.usenix.org/conference/usenixsecurity19/presentation/rudd
Can use this approach to cluster threats and prioritize easy to inspect via t-SNE tested with known positive controls they don't have to do any unpacking of the binary with this approach someone asked why not do topic modeling, he says they tried that and didn't pursue it, but someone should
**Laura Dedic -- Novetta ** CNN-based malware visualization and explainability