If you have an imbalanced dataset, the typical strategies to train a classifier are to oversample the minority class, or modify the loss function to penalize mis-classifications of the minority class more than mis-classifications of the majority class. If you think about it, these methods are mathematically equivalents?
Based on experiments that I ran with synthetic data, these methods are not very effective either. Yet they are widely used. Am I missing something?
David Tse recently gave a talk where he said that there is often an unexplored information-theoretic angle to many well-studied machine learning problems. For example, you can use the maximum entropy principle to fit a probability distribution based on samples from it, instead of using maximum likelihood.
That thought ran through my head as I was reading this new paper for journal club: the authors use mutual information as the criterion to determine which features of an image (or any input more generally) are the most responsible for a prediction that is made by a trained neural network: https://arxiv.org/pdf/1802.07814.pdf
When machine learning is used to make better black-box prediction models for biomedical questions, the expected useful lifetime of the model is small. You can be sure that a bigger model trained on more data or for more time will come along, and make your model obsolete.
What may be more useful in the long run is to develop techniques that use data to discover biological mechanisms. These discoveries will stay relevant even if the original models used to discover them are replaced by other techniques.
In the Node2Vec paper, the authors propose that an embedding for every node in a graph can be learned by trying to maximize the dot product between the feature representation of a node and its neighbors, as determined by BFS (useful for structural equivalence) or DFS (useful for learning homophily, or finding nodes that are connected together).
I think that the strategy for homophily makes sense, but the BFS strategy doesn’t make sense. Imagine that you have have a simple graph that has two two nodes \(A\) and \(B\), and each are surrounded by 5 neighbors, and let’s assume that the neighbors are only connected to \(A\) and \(B\) respectively. In that case, if you wanted to learn structural equivalence, then \(A\) and \(B\) should have a similar embedding, while the neighbors should all have similar embeddings.
However, the objective function that the authors propose won’t achieve this. Instead, it will propose that \(A\) and its neighbors have similar embeddings, and that \(B\) and its neighbors have similar embeddings. Could this be ameliorated by having a “left” and “right” embedding vector for each node? Then, the objective function would be that you maximize the dot product between the left vector of a node, and the right vector of each of its neighbors?