For my research at UCL, I ran IM log analysis focusing on the English and Chinese languages. The statistical analysis assumed each individual Mandarin character/pictogram as 1 individual character, in order to compare them with their English equivalent.
This disturbed me then and disturbs me still, because each Mandarin ideograph can in fact contain more than one pictogram.
Hence, how accurate and precise are results obtained this way, since there is no one-to-one mapping between all Mandarin characters and English words?
Perhaps I need to check each set of ideograms before matching them to an English word.