CAMLIS day 1 part 1

Aleatha Parker-Wood, PhD Keynote HuMu, Symantec, many security-related patents

BYOD (bring your own device) - bigger attack surface

DLP (data loss prevention) - need more data, but that expands your attack surface

Need strict ACLs - have to avoid letting marketing use data intended only for security models

Encryption is not a magic bullet

"You can't store enough data to protect all your data"

Have to worry about model inversion attacks

"privacy for all users... except the bad ones"

Incremental learning, e.g. Bloom filter

Sketching - learning summary statistics as you go, e.g. rolling averages, hyperloglog Robust against threats, low space needs

Online learning, e.g. Hoeffding is an online decision tree, or stochastic gradient descent on mini-batches Harder to do model selection because "you can't go back to the original data" (or have to store some)

Data poisoning attacks - big area of research, see IEEE Security & Privacy

Differential Privacy - ref Dwork 2006 (not available for free)

Carefully calibrated noise to cover user identity Add a randomization factor (unbiased error) - epsilon "probability correction", and some also add a delta In theory it sounds great, but she says in practice, not so much

ex. Palpatine is CEO, Padme is Eng, Anakin is Sales Differencing Attack - Anakin didn't respond to the survey, so Palpatine can easily identify Padme's answers

Solution: use a "high water mark" tl;dr over-report just a little all the time (with fake data I guess?)

2020 Census will use differential privacy; Apple uses it for predictive text

Advantages: prevents overfitting (Abadi 2016), protects against breaches, poisoning, and insider attacks Disavantages: suppresses outliers (so not great for anomaly detection or security investigations), choosing epsilon is "a black art", requires more data because noisy by design

Private multi-party ML - distributed learning across mutually distrusting systems

ex. predictive text where data stays local to the phone, great for GDPR

Different kinds of solutions:

Secure differential privacy - fast, but accuracy issues
Secure multiparty computation (SMC) - high I/O overhead (due to multi-round communication required), but higher accuracy
Homomorphic encryption - special case of SMC -- never decrypt

Truex et al. (paper coming out soon) - hybrid of differential privacy + homomorphic encryption

Papernot et al. PATE - Private Aggregation of Teacher Ensembles. Learn locally on private data, predict on public. Accuracy and efficiency are challenges. Problems on the mathematical side e.g. encryption.

@aleatha aleatha@humu.com

Felipe Ducau - Describing Malware via Tagging Sophos - been there 2.5 years

Tokenize information about malware - what type it is, whether it's compressed, etc.

deep learning approaches - binary entropy model Trained a model on 1 year of data (10M or 76M?), validation set 3M, test set 3.8M Ended up with 20 features - ref Saxe and Berlin 2015

Multi-head neural nets get 96% coverage explode to get more tokens, then collapse back down

joint embedding was actually better and more interpretable, comes from computer vision use dot product for distance to get relationships between tags mean TPR (true positive rate) of 0.88, overall 0.71 AUC 0.99 pre-print is available on arXiv: https://arxiv.org/abs/1905.06262

ALOHA- auxiliary loss optimization for hypothesis augmentation, ref. Rudd et al. 2019 video is here: https://www.usenix.org/conference/usenixsecurity19/presentation/rudd

Can use this approach to cluster threats and prioritize easy to inspect via t-SNE tested with known positive controls they don't have to do any unpacking of the binary with this approach someone asked why not do topic modeling, he says they tried that and didn't pursue it, but someone should

**Laura Dedic -- Novetta ** CNN-based malware visualization and explainability

szeitlin/CAMLIS_notes_2019-10-26.md