There was a mathematically unsatisfactory bit in the last post about measuring the relationships among mentions of color in The Lord of the Rings. When I used the Euclidean distance between the 62-dimensional vectors to calculate the relationship between color mentions, the dendrogram had some connections in it that don’t make much sense visually or textually, e.g., brown was clustered with black and red. The connections with a linear “Manhattan” distance measure made much more sense.  I asked Digital Tolkien about it, as one does, and he assured me that the L1 metric was better. But why?

It turns out this is something that mathematicians know: in high-dimensional spaces, using the Pythagorean Theorem causes near neighbors and far-away neighbors to be all about the same distance apart! 1 In fact, the choice of which of your neighbors is nearest isn’t even stable. The unavoidable numerical errors that come from using digital computers can dominate the real differences in the input data.2

Effect measured by L1 distance is more detectable at high dimension

Relative effect as a function of measure dimension

Of course, now that I’ve read a couple of papers about it, it’s obvious. Simplest possible case: suppose a book mentions one word once per chapter, and another word twice in one chapter and once in all the others. The relative difference between those two vectors, as a function of the number of chapters, looks like this.

62 dimensions counts as “high-dimensional”. Both ways of measuring distance have dropped a lot from our 3-dimensional experience, but the effect in our test case is twice as easy to compare when we use the Manhattan distance measure.


Notes

Notes

  1. Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. “On the surprising behavior of distance metrics in high dimensional space.” International conference on database theory. Springer, Berlin, Heidelberg, 2001.
  2. Chris R. Giannella. “Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data”. Information Processing Letters 169, 2021, https://doi.org/10.1016/j.ipl.2021.106115