I watched a video from Geoffrey Hinton, who explained why this happens.
In 2019, when I first learned about Neural Network, I tested the robustness of neural networks by adding random noise. To my surprise, if the data is big enough, even with 80% of random error, the neural nets could get 95% accuracy. [doc]
Today, in this video, Geoffrey indicates that “the rule of thumb is basically what counts is the mutual information between the assigned label and the truth. That tells how valuable your training example is.”
Geoffrey says if there’s like a 1/50 of the mutual information and we have 50 times as many examples, can we have the same performance? Yes, you do. If the training sets are 2*50 times as many.
Let me do a test on our dataset, GIGO maybe not be correct for Neural Nets.
In the future, I would like to find more research papers for explaining this.