Something I had considered recently was “how much is this sentence like another?” I figured it would be an interesting exploration at the intersection of natural language processing, statistics, and information theory. It certainly motivates the application of theory and data gathering. To be surem there are even many levels of rigor in approaches one may develop, ranging from heuristical to principled.
Then, we consider a few problems to solve at these different levels.
1) measurement of similarity
2) space through which things are measured
to point 1, a mathematical measure is an analytic construction that satisfies specific properties.
If we want to be a little less abstract and immediately accessible, albeit at the cost of rigorous guarantees, we might design some heuristic approximation of the former.
Towards the ends of number 2, an embedding space is a popular approach. More specifically, it is a learned representation of semantic relationships, whereby an embedding vector’s dimensions each correspond to a learned feature expressed through this dimension. These relationships of course are characterized by the engineer, specifically through the objective, the data, and the training regime.
Learning the embedding space is, intuitively, allowing the data to describe, themselves, the most salient behaviors along which they relate to each other. Another task that will be later explored is how the dimension of the projection space relates to the interpretability of the vectors’ corresponding semantics, i.e. what does each dimension in a vector describe, approximately? There is a prevailing hypothesis within machine learning, termed the data manifold hypothesis. This is a description of the idea that all learnable data spaces have some lower dimensional representation where all the data relate to each other without loss of information, basically.
The measure theory approach requires some table setting.