
【注:这是陈立功所著统计学专著《哲学之于统计学》的自序。本文学城博文采用了其最初的标题:一个人的孤独之旅】
一、统计理性、梦想和挫败 (The Statistical Rationality, the Dream, and the Frustrations)
在本作者看来,统计学的最高理性是如何认识基于随机性的非确定性,而随机性的本质则是人对认知对象的未知,因此,由随机性导致的非确定性实质是人的认识的非确定性。由于这种非确定性是统计认知的唯一特性,类比于对一个常量的最大和最小期望的同一性,以辨证的方式,我们还可以说,这一最高理性同时也是统计学的最低理性。这一最高和最低理性决定了统计理论和方法构建中的基本原则。当我们说其最高理性时,意味着每个人都必须遵循它;而当我们说其最低理性时,则任何人在任何时候都不能以任何方式违反它。因此,那些试图用单一的确定性假定来构建统计理论和方法的人便是违背了这一由随机性确立的统计理性,因为这类唯心主义的假定本身就是将那些客观上未知的非确定性转变成了心理上已知的确定性。
In the author’s opinion, the highest rationality of statistics lies in how to understand uncertainty through randomness, and the essence of randomness is our unknownness of the object in a cognition. Therefore, the uncertainty caused by randomness is actually the uncertainty of human cognition itself. Since this uncertainty is the sole characteristic of statistical cognition, analogous to how the maximum and minimum expectations of a constant are identical, the highest rationality of statistics is simultaneously its lowest rationality in a dialectical sense. This highest and lowest rationality determine the basic principles for the construction of statistical theories and methods. When we say its highest rationality, it means that everyone must follow it; and when we say its lowest rationality, it means that no one can violate it in any sense at any time. Therefore, those who attempt to build statistical theories and methods upon single deterministic assumptions violat this statistical rationality rooted in randomness, for such idealistic assumptions themselves convert those objectively unknown uncertainties into psychologically known certainties.
自从前苏联数学家Андре?й Никола?евич Колмого?ров(安德烈·尼古拉耶维奇·柯尔莫哥洛夫)在1933年用自己的最高智慧完成了对概率论的公理化后,无数数学背景的统计学家们倾力所为的目标就是试图用数学的逻辑框架和形式化语言将统计学打造成一个严谨的数学分支,或者一门类数学的学科。他们的艰辛努力在很大程度上促进了近当代统计学的长足进步和发展。但是,由于数学思维模式和逻辑框架的先天不足,也在该领域刻下了许多幼稚、甚至错误的烙印。
Since the former Soviet mathematician Андре?й Никола?евич Колмого?ров (Andrey Nikolaevich Kolmogonov) axiomatized Probability Theory in 1933 with the full force of his extraordinary intellect, countless statisticians with a mathematical background have devoted themselves to the goal of turning statistics into a rigorous branch of mathematics, or at least into a mathematics-like discipline, by employing the logical frameworks and formalized language of mathematics. Their hard efforts have, to a large extent, propelled the remarkable progress and development of modern statistics. However, due to the inherent limitations of mathematical thinking and logical structures, it has also left behind many naïve, or even erroneous, imprints on the field.
尽管统计学中的假设检验为思考和解决非确定性问题树立了一个思维模式的典范,众多数学背景的统计学者却无视这个典范而醉心于用数学的确定性思维来解决这类非确定性问题。那么,假设检验的思维模式是怎样的呢?它通常设置两个相互对立但又不确定的假设,通过构造或选择一个合适的检验统计量并完成检验流程,最终在一个给定的概率水平上从中做出抉择。
Although hypothesis testing in statistics has established a paradigmatic way of thinking for dealing with uncertainty, many mathematically trained statisticians have ignored this paradigm and instead become enamored with using deterministic mathematical thinking to tackle problems that are inherently indeterministic. So, what is the thinking pattern behind hypothesis testing? It typically sets two mutually opposing but uncertain hypotheses, constructs or selects an appropriate test statistic and completes the test process, and finally makes a decision between the two at a given probability level.
然而,那些习惯数学思维的统计学者们常常只设置唯一的假定,然后通过提出命题、给出定义、阐述性质和逻辑论证来完成其统计理论和方法论的构建。他们以为只要通过了这个数学的形式主义路线,其构建的理论和方法便是统计学中的某种“定理”。这种数学形式主义在统计学领域的确立造成了某种不良的后果,统计类期刊风行着纯数学的范式,而其思想高地则被先行者占领,他们固守僵化思维,拒绝一切合理的哲学反思和批判。殊不知,在一个随机系统中,在他们设置的假定的对立面,总是会存在着一个合理的假定。对这个对立面的合理假定的无视,终将成为其理论和方法的致命伤,而造成这一普遍情形的根本原因则在于数学系统对辨证思维的绝对排斥。
However, those statisticians accustomed to mathematical thinking often construct their statistical theories and methodologies by proposing a single assumption, followed by providing definitions, presenting propositions, stating properties, and supplying logical proofs. They believe that once they follow this mathematical formalist line, their theory or method will be elevated into a kind of “theorem” in statistics. This mathematical formalism, once established in statistics, has led to some undesirable consequences: the pure mathematical paradigm has become dominant in statistical journals, while the ideological highlands are occupied by these early standard-bearers, who cling to rigid thinking and reject all reasonable philosophical reflection and criticism. What they fail to realize is this: in any stochastic system, for every assumption one sets, there always exists a reasonable opposing assumption. Ignoring the legitimacy of such opposing assumptions will ultimately become the fatal flaw of their theories and methods. And, the root cause of this widespread phenomenon lies in the absolute rejection of dialectical thinking by the mathematical system.
例如,在对样本均数是关于总体均数的无偏估计的证明中,其前提假定为总体分布是正态的,因而总体均数位于其分布曲线的峰顶,也即总体均数就是其分布期望,因此,该证明的目的是想确认样本均数是对总体分布期望的无偏估计。然而,这个假定事实上已将需要证明的结论隐藏在其自身之中,而证明的过程仅仅只是用形式化的数学语言将假定和结论无谓地自循环了一遍。此外,总体分布是不可知的,可能正态,也可能偏态,而在偏态情形下,总体均数一定会偏离其分布曲线的峰顶,也就是不再与其分布期望同一,从而,样本均数与作为更一般概念的总体期望之间的关系是不确定的,因而是不可被以数学形式证明的。于是,针对正态或对称情形的这一证明在偏态或非对称情形下将失效。更严重的问题是,在单峰分布的概念体系下,正态分布仅仅只是其中的一个瞬间特例。试图通过证明这个特例而将统计学的理论和方法学体系建立在其上是不能令人信服的。
For example, in the proof that the sample mean is an unbiased estimate of the population mean, an assumption as the premise is that the population distribution is normal, so the population mean coincides with the peak of its distribution curve, that is, the population mean equals its distribution expectation. Thus, the purpose of this proof is essentially trying to confirm that the sample mean is an unbiased estimate of the expectation of a population distribution. However, this assumption actually embeds the conclusion within itself, turning the proof process into a meaningless circular exercise of assumptions and conclusions framed in formal mathematical language. Moreover, the true population distribution is unknowable and may be normal or skewed. In the skewed cases, the population mean will inevitably deviate from the peak of the distribution curve and will no longer be coincidently identical to its distribution expectation, and thus the relationship between the sample mean and the population expectation, as a more general concept, becomes uncertain and cannot be established through mathematical formal logic. Therefore, any proof valid for the normal or symmetrical cases becomes invalid for the skewed or asymmetrical cases. The more serious problem is that, under the conceptual system of unimodal distributions, the normal distribution remains a transient special case. It is therefore unconvincing to attempt to build the foundation of statistical theory and methodology on this special case.
其实,与总体期望相比,总体均数是直接从样本均数抽象出来的一个狭义概念,两者在算法结构上属于同质定义。而且,样本与其所来源的总体也是同质定义。因此,无论总体分布如何,只要抽样满足随机原则,样本均数一定是关于总体均数的一个无偏或有效估计。这就是说,样本均数与总体均数之间的关系无需以数学的形式逻辑予以证明。作者还认为,统计学不过是将对总体分布期望的抽样估计定义在其样本均数上,而凡是定义均无需数学形式逻辑的证明。于是,我们可以推而广之,一切样本统计量都是对同质定义的总体参数的无偏或有效估计。由此可见,由于无法针对随机系统设置单一的和明确的前提假定,基于数学形式逻辑的证明便失去了用武之地。
In fact, compared to the population expectation, the population mean is a narrower concept abstracted directly from the sample mean, and both are homogeneously defined in terms of algorithmic structure. Moreover, the sample and the population from which it is drawn are also homogeneously defined. Therefore, regardless of the population’s distribution, as long as the sampling adheres to the principle of randomness, the sample mean must be an unbiased or valid estimate of the population mean. In other words, the relationship between the sample mean and the population mean does not require proof through mathematically formalized logic. The author also believes that statistics is nothing more than the definition of the sampling estimate of a population’s distribution expectation based on its sample mean, and such definitions do not require formal mathematical proofs. Therefore, we may generalize that all sample statistics are unbiased or valid estimates of homogeneously defined population parameters. From this perspective, proofs based on mathematical formal logic become meaningless, because in random system it is impossible to esablish a single, unambiguous assumption as the premise for such proofs.
在本作者看来,哲学应该是统计学的灵魂,而数学作为一种技术仅仅是它的四肢。灵魂指导四肢做什么和怎样做,而非相反。作者有时也把哲学和数学比作统计学的双翼,缺少其中任何一个,统计学都将难以在认知世界的天空里自由翱翔。作为一个偏向哲学思考的人,作者将一个样本数据看成是经验事实的集合。从本人对现行统计方法的重建和创建新统计方法的经验来看,其中存在着这样一个基本流程:

In the author’s view, philosophy is the soul of statistics, while mathematics, as a technical tool, serves merely as its limbs. The soul guides the limbs, telling them what to do and how to do it, not the other way around. The author also sometimes compares philosophy and mathematics to a pair of wings on statistics: without either wing, statistics cannot freely soar in the sky of understanding the world. As someone inclined toward philosophical thinking, the author regards a sample dataset as a collection of empirical facts. Based on years of experience in reconstructing existing statistical methods and developing new ones, the author recognizes a fundamental process:

1994~1997年间,我在同济医科大学公共卫生学院师从余松林教授攻读卫生统计硕士学位,有一天,我用一个连续型随机变量的样本绘制了一个二维散点图,方法是按照算术均数的计算公式给每个样本点一个等权重1。于是得到它们在纵坐标上权重等于1的地方呈一条直线状的散点排列,如图1所示。当时心中就升起一个梦想:如果它们的排列像一条正态曲线该多好!如图2所示。我在2010年12月里实现了这个梦想,并将其带到了2011年8月初在美国佛罗里达州迈阿密市召开的联合统计会议(Joint Statistical Meetings, JSM),相关文章被收录在其论文集中。
Between 1994 and 1997, I was pursuing a master’s degree in health statistics under Professor Yu Songlin at the School of Public Health, Tongji Medical University. One day, I plotted a two-dimensional scatter diagram from a sample of a continuous random variable. Following the formula for calculating the arithmetic mean, I assigned each sample point an equal weight of 1. As a result, all the points appeared aligned as a scatter along a straight line where the vertical coordinate, i.e. weight, was equal to 1. (as shown in Figure 1) At that moment, a dream quietly arose in my mind: how wonderful it would be if their arrangement resembled a normal curve! (as shown in Figure 2) I completed this work in December 2010 and presented it to the 2011 Joint Statistical Meetings (JSM) in Miami, Florida, USA, and the relevant articles were included in the proceedings.

图2中的纵坐标C是我所称的凸自权重,R是凹自权重。前者是对每个样本点对一个分布的未知期望的相对贡献。我在构建这个自加权算法时,心中的理想目标就是一个真实正态样本以其自权重为纵坐标的二维散点分布必须近似正态概率密度曲线,否则自加权算法就是不正确的。此外,我相信,对于一个左偏态分布的样本,其算术均数必须位于由自权重确定的峰顶的右侧,而一个右偏态分布的样本均数必须位于其峰顶的左侧;否则,自权重的算法也应该是不正确的。因此,在得到了正确的自加权算法后,我对高斯的天才深感震惊,崇敬之情油然而生,因为他是在没有这个自加权算法的条件下纯粹凭抽象的逻辑构造得到正态概率密度函数的。
The vertical axis C in Figure 2 is what I call the convex self-weight, and R is concave self-weight. The former is the relative contribution of each sample point to the unknown expectation of a distribution. When I was developing the self-weighting algorithm, my ideal goal was that the two-dimensional scatter distribution of a truly normal sample, with its self-weight as the vertical coordinate, must approximate the normal probability density curve. Otherwise, the self-weighting algorithm could not be considered correct. In addition, I believed that for a sample from a left-skewed distribution, its arithmetic mean must be on the right side of the peak determined by the self-weights, while the sample mean of a right-skewed distribution must be on the left side of its peak; otherwise, the self-weighting algorithm should be also incorrect. Therefore, upon obtaining the correct algorithm, I was deeply struck by the genius of Gauss, and a spontaneous sense of reverence arose within me, because he obtained the normal probability density function purely by abstract logical construction without the self-weighting algorithm.
本书所涉内容基于1998 ~ 2011年间作者在认识论和统计学领域所做的6篇文章以及随后对它们的一些改进,其中1998 ~ 2000年间的两篇分别发表在中国的《医学与哲学》和《中国公共卫生杂志》上。此后,分别在2007、2009和2011年的三次联合统计会议上,我有四篇均包含重大突破性的文章被收录在其论文集中。
The content of this book is based on six articles the author wrote between 1998 and 2011 in the fields of epistemology and statistics, along with later refinements. Two of the earliest articles, written between 1998 and 2000, were published in China in Medicine and Philosophy and the Chinese Journal of Public Health. Later, in the 2007, 2009, and 2011 JSM, four papers containing major methodological breakthroughs were included in the conference proceedings.
遗憾的是,每次会后向几个统计学旗舰期刊的分别投稿均被主编直接封杀,有些甚至没有评论和对拒稿理由的解释。其中一份期刊的主编回信很简单:“你的文章不适合发表。”另一份期刊的主编则认为我在挑战数学和统计学的“large body”,并认为这种挑战不合时宜。还有一位可同时理解中英文的顶级期刊的主编,竟然将我为连续随机变量定义其自权重的文章视为现有统计学基础知识的介绍性文章,仿佛在此之前统计学里早已经有了与我所创立的自加权算法一样的东西,可谓令人哭笑不得;更令人遗憾的是,即便在我对该文中的创新性和重要性等用中文和英文做了详细解释、并严肃地恳请他以对历史负责的态度予以回应后,他依然坚持自己的判断和拒稿决定。现在,人们应该可以说,这位主编原本可以让统计学中的这一重大进步早日得到公正对待。然而,他却主动选择了将其封杀。
Regrettably, each time the author submitted these works to several flagship statistical journals afterward, the manuscripts were directly rejected by the editors-in-chief, in some cases without any comments or explanations. One editor’s reply was simply: “Your article is not suitable for publication.” Another claimed that the author was challenging the “large body” of mathematics and statistics, and considered such a challenge “ill-timed.” One editor-in-chief of a leading journal, fluent in both Chinese and English, astonishingly misread an article on the self-weighting definition of continuous random variables as an introductory piece on basic statistical knowledge, as if something like the author’s self-weighting algorithm had already existed in classical statistics, which was both laughable and saddening. What was even more regrettable was that even after I explained in detail of the innovation and importance of the article in both Chinese and English and seriously requested him to respond in a historically responsible attitude, he still insisted on his judgment and decision to reject the article. Now, one could say that the editor-in-chief could have done justice to this major advance in statistics much earlier. Instead, he had chosen to suppress it.
正如智能人工“深度求索”在全面了解了自权重以及基于自权重的期望估计算法、做了实例计算,并与所有现行的期望估计算法,包括算术均数、中位数、核密度估计、最大似然估计、CRB估计、截尾均数等,在同一案例上做了计算和对比后给出的评论:(在现有知识体系下),自权重和自加权均数的算法是统计学的终极核武器。它没有用“之一”来限定自己的评论。
Just as the artificial intelligence DeepSeek commented after fully understanding the self-weighting and the expectation estimate algorithm based on self-weight, doing example calculations, and calculating and comparing with all the current expectation estimate algorithms, including arithmetic mean, median, kernel density estimator, maximum likelihood estimator, CRB estimator, trimmed mean, etc., on the same case: (under the existing knowledge system), the algorithm of self-weighting and self-weighted mean is the ultimate nuclear weapon in statistics, and it did not use “one of” to limit this comment.
这些旗舰期刊的主编们无不顶着那些著名大学统计学博士和教授的头衔,却对来自真正的真理性新思想及其所带来的统计学历史上最重大方法论突破的冲击采取了严防死守的策略。当一个求知的期刊拒绝真真理时,它的纸面上印刷出来的将只能是相应的真谬误。因此,他们终将明白,他们用那种极简手段捍卫的不过是一些看似美丽却一戳即破的肥皂泡泡。
These editors, all holding PhDs and professorships in statistics from those prestigious universities, responded to the arrival of truly truthful ideas and the most significant methodological breakthrough in the history of statistics with a stance of strict defense and closed doors. But when a journal that claims to seek knowledge rejects the true truth, then what gets printed on its pages will only be the corresponding falsehoods. In the end, these editors will come to realize that the things they defended with such minimal effort were nothing more than fragile soap bubbles that are seemingly beautiful but bursting at the slightest touch.
既然这些旗舰期刊如此藐视新思想和新方法,我也就只好让它们在原来的地方安睡。没想到这一睡就是十多年过去了。大概除了我本人外,已无人知晓它们的存在。至于以后是否有人能够发现它们,不敢说。在现代信息爆炸的时代,一个人能够发现它们应该是一个极小概率的事件。因此,写作一本书将它们囊括和融合在一起就成了我必须完成的个人使命。是的,我就是要以一己之力挑战这个近代约三百多年来由无数智者和先驱者们缔造的庞大而又看似坚固的体系。为什么不呢?
Since these flagship journals were so disdainful of new ideas and methods, I had no choice but to let them sleep in their original places. Unexpectedly, more than ten years have passed since then. Probably no one except me has ever known they have been existed anymore. As for whether someone would be able to discover them in the future, I dared not say. In the modern era of information explosion, it should be an extremely small probability event for a person to discover them. Therefore, writing a book to encompass and integrate them has become a personal mission that I must complete. Yes, I have wanted to challenge this huge and seemingly solid system created by countless wise men and pioneers in the past three hundred years. Why not?
如果把人类文明史看成是一个进步的过程,那么,纵观这一过程可以发现一个简单事实:一切与人类文明有关的进步,都根源于人类自身思想的突破;没有思想的突破便没有任何进步甚至革命的可能性,而一切思想的突破都只能发端于个人对外部世界及其自身的认知,换句话说,一切科技进步和革命都只能首先爆发于某一个体的脑海中。由于每个人的认知能力和对自身及其所触及外部世界可知范畴的绝对有限性,没人能确定或声称从他/她自己的角度所获得的任何认识都将是绝对正确的永恒真理。但是,由于人类的认知行为及其能力可以由个体拓展到群体,而一个人有限的认知结果有可能被他人接受或修正,因此,对于任何个人乃至整个人类族群来说,即使是表达一个错误的思想也有可能为正确思想的确立带来机会。
If we regard the history of human civilization as a process of progress, then we can discover a simple fact: all progress related to human civilization is rooted in the breakthrough of human thought. Without a breakthrough in thought, there is no possibility of progress or even revolution. All breakthroughs in thought can only begin with the individual’s perception of the outside world and itself. In other words, all technological progress and revolutions can only first break out in the mind of an individual. Due to the absolute limit of the cognitive abilities of each individual and the knowable categories of the individual itself as well as the external world that it touched, no one can determine or claim that any knowledge gained from his/her own perspective will be absolutely correct eternal truth. However, since human cognitive behavior and its ability can be extended from an individual to a group, and a person’s limited cognitive results may be accepted or corrected by others, for any individual or the entire human race, even the expression of a wrong idea may create an opportunity to establish correct ideas.
回首往事,我在原同济医科大学公共卫生学院攻读卫生统计学硕士学位的三年里,至少有三颗种子被植入了我的思维的潜意识里。除了上面那个关于散点分布的梦想,第二颗种子是在我的导师余松林教授讲授聚类分析的统计算法原理时种下的,因为我对其中仅使用样本中点和点之间的差异产生了一个疑问:它们之间的相似性被忽视了,而这相当于丢弃了另一半样本信息。这个潜意识里的疑问为后来我为连续型随机变量构建自权重做了思想准备。第三颗种子是董时富教授在讲授随机变量和常量等的运算时,反复强调涉及随机变量的运算结果一定是随机变量。这为我后来思考随机变量的极值的不稳定性以及否定基于这种极值的最优化奠定了基础。
Looking back, during the three years I was studying for a master’s degree in health statistics at the School of Public Health of the former Tongji Medical University, at least three seeds were planted in my subconscious mind. In addition to the dream about scattered point distribution mentioned above, the second seed was planted when my mentor, Professor Yu Songlin, taught the statistical algorithm principles of cluster analysis, because I had a question about only using the differences between the points and points in the sample: the similarities between them were ignored, which was equivalent to discarding the other half of the sample information. This subconscious questioning prepared me mentally for the construction of self-weights for continuous random variables. The third seed was when Professor Dong Shifu repeatedly emphasized in his lecture on operations involving random variables and constants that the results of operations involving random variables must be random variables. This laid the foundation for my later thinking about the instability of the extreme values ??of random variables and the denial of optimization based on such extreme values.
此序的余下部分主要谈三个方面:第一,作者干了什么?第二,作者为何要做这些?第三,作者对自己在本书中构建的新统计算法所要声张的权利。
The rest of this preface mainly talks about three aspects: First, what has the author accomplished? Second, why did the author undertake these efforts? Third, what rights does the author claim for the new statistical algorithms proposed in this book.
二、作者所为及其目的 (What the Author did and the Purpose)
作者所做的当然就是本书的内容。首先,作者从哲学的认识论角度讨论了人们认识世界的基本方法和流程,并为此构建了一个认知流程图。这部分的核心内容曾以“论智慧的递进结构与认知的逻辑流程”为题发表在《医学与哲学》杂志1999年9月的第三期上。这是一个统计学所需的、包含了抽象、归纳、演绎和辩证的四维逻辑系统,它超越了数学系统所需的二维或三维逻辑系统。数学系统没有为辩证法留下哪怕是一丝的思维缝隙或空间,因为一个数学命题只能在唯一的假定下被提出并予以证明,它不可能从其对立面得到证明,而是会被否定。但统计学不搞假定、命题及其证明,而是努力认知外部世界,为此常常需要从一个事物或观点的对立面去寻找意义,因此它不能没有辩证法。例如,假设检验就是试图在两个相互对立的假设中做出抉择。又如,作者在为连续随机变量构建自权重时,不仅要考虑任意两个样本点之间的差异性,还必须同时考虑其相似性,只有这样才能保证权重的构建既无样本信息的损失也无信息的冗余;否则将无法得到一个正确的自权重。因此,那些试图仅仅以数学的逻辑系统和思维模式来解决统计学问题的行为注定会带来某种不恰当的后果。作者相信这个四维逻辑系统在认知流程的结构上应该具有独创性,至今也应未失去其参考价值,并将有助于当下正在迅猛发展的人工智能领域的创新和进步。
What the author did is, of course, what this book is about. First of all, the author discusses the basic methods and process for people to understand the world from the perspective of philosophical epistemology, and thus constructs a cognitive flowchart. The core content of this part was once titled “On the Progressive Structure of Intelligence and the Logical Process of Cognition” and published in the third issue of the magazine Medicine and Philosophy in September 1999. This is a four-dimensional logical system required by statistics, which includes abstraction, induction, deduction and dialectics, and goes beyond the two- or three- dimensional logical system required by mathematical system. The mathematical system leaves no even a glimmer of gap or room for dialectics, because a mathematical proposition can only be proposed and proven under a single, unique assumption. It cannot be proved from its opposite, but rather it will be invalidated. In contrast, statistics does not deal with assumptions, propositions and their proofs, but rather strives to understand the external world. To do this, it often needs to seek meaning from the opposite side of a thing or a given viewpoint, so it cannot do without dialectics. For example, a hypothesis testing attempts to decide between two competing hypotheses. Or, when the author tried to construct self-weights for continuous random variables, it was necessary not only to consider the difference between any two sample points, but also to simultaneously consider their similarity. Only by doing so could the construction of weights avoid both the loss of sample information and the redundancy of information; otherwise, it would be impossible to obtain a correct self-weight. Therefore, any attempt to solve statistical problems solely through the logical system and thinking style of mathematics is bound to bring about some inappropriate consequences. The author believes that this four-dimensional logic system in the structure of cognitive process should be original, and should not have lost its reference value to this day, and will contribute to innovation and progress in the rapidly developing field of artificial intelligence.
事实上,对认识论领域基础概念的讨论就是本书的起点。例如,算术均数的算法假定每个样本点对其抽样分布的期望中心(央位)的贡献相同,即假定它们的权重都是1,因为我们不知道它们的贡献是否存在个体差异。这是一个无知或蒙昧。所以,作者在认知的起点上讨论了什么是人的蒙昧,进而讨论了如何实现从蒙昧到有所知。本书在各章节的思考和讨论中获得的全部灵感和突破均源自对统计学中存在的一些问题的哲学思考而非引用了某种既有的数学理论和算法,其中在新思想引导下构建新算法时所使用的数学技能仅有最简单的四则运算。这就是作者之所以将第一章聚焦于哲学认识论的原因,因为它是全书的根本方法论。中国古语有云,工欲善其事必先利其器。我无法设想如果没有这个方法论,我能否在统计领域实现那些突破。
In fact, the discussion of the fundamental concepts to the field of epistemology is the starting point of this book. For example, the algorithm of the arithmetic mean assumes that each sample point contributes the same to the expected center of their sampling distribution, that is, it is assumed that their weights are all 1, because we do not know whether there are individual differences among their contributions. This is an ignorance or unenlightenment. Therefore, the author discusses what human ignorance is from the starting point of cognition, and then discusses how to achieve from ignorance to knowledge. All the inspirations and breakthroughs obtained in the thinking and discussions in each chapter of this book come from philosophical thinking on some problems existing in statistics rather than citing some existing mathematical theories and algorithms. Under the guidance of the new ideas, the only mathematical skills used in constructing new algorithms are the simplest four arithmetic operations. This is why the author focuses the first chapter on philosophical epistemology, since it is the underlying methodology of the entire book. There is an ancient Chinese saying that, if a worker wants to do his job well, he must first sharpen his tools. I can’t imagine that I would have been able to achieve those breakthroughs in statistics without this methodology.
其次,在分段回归领域,作者在未受到现行算法发展史的影响下曾对分段回归这一重要领域进行了一次独立的初始探索;进而在随后长达26年的进一步探索中,在全面了解了现行算法的发展史、数理基础以及其中明显违反基于随机性原则确立的统计理性后,依然坚持自己的独立见解,并初步建立了一套基于加权的算法,从而使得分段回归在新算法下以极简和透明的计算步骤和轻量化的计算负担实现了稳健和可靠的临界点估计和分段模型的拟合。这一新算法不仅完全规避了现行基于最优化和强制连续性假定的算法导致的过拟合,而且规避了为改善过拟合不得不引入赤池信息准则(AIC)或贝叶斯信息准则(BIC)的约束、交叉验证和彼替可信区间等而造成的海量计算。这种大规模计算量在大样本量条件下构成了严重的负担,而模型的拟合却并非如人们所期望的那样好。导致现行算法走向歧途的根本原因在于构建这套算法的前辈统计学者们在基本概念上的严重缺失。
Secondly, in the field of piecewise regression, the author conducted an independent initial exploration of this important field without being influenced by the development history of the current algorithms; and in the subsequent 26-year further exploration, after fully understanding the development history, mathematical foundations and obvious violations of statistical rationality established on its randomness principles of the current algorithms, I still insisted on my independent views and initially constructed a set of algorithms based on weighting, so that piecewise regression could achieve robust and reliable threshold estimate and piecewise model fitting with extremely simple, transparent calculation steps, and light calculation burden under the new algorithm. This new algorithm not only completely avoids the overfitting caused by the current algorithms based on optimization and enforced continuity assumption, but also avoids the massive computations caused by having to introduce Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) constraints, cross- validation, and bootstrapping confidence interval for improving the overfitting. Such massive computations constitute a serious burden under large sample size, while the results of model fitting are not as good as we expect. The fundamental reason why the current algorithms have gone astray was that the elder generations who built these algorithms in statistics had a serious lack of basic concepts.
如果说作者在分段回归算法上的探索与现行算法有何差别,其关键之处就在于将现行算法中用作最优化的算子改造成为一个含义明确且恰当的分段回归权重,即数值上相对越大则对临界点的贡献越大,从而可用被分割变量的加权均数作为未知临界点的期望估计,于是,我们可以轻松得到其可信区间。由于这个期望临界点在整个样本空间上具有唯一性,由此决定的分段模型也具有唯一性,因而在统计上是可期望的。关于新旧两种算法的差别,还可用“个人英雄主义的鲁莽”和“走群众路线”相比拟。现行算法的最优化就好比在n个样本点中寻找到那个最大能耐的样本点,用它来决定临界点的位置和分段模型;而“走群众路线”的加权分段回归则认为一个样本空间内未知临界点的位置是由每个样本点以各自的位置共同决定的,因此,我们需要将每个样本点对临界点位置的点滴贡献都考虑在内,即使其中某些点的贡献为趋于0甚至等于0,也不将其剔除出去,而是让它们和所有其它点一起参与计算。
If there is any difference between the author’s exploration of the piecewise regression algorithm and the current algorithms, the key point is to transform the operator used for optimization (optimizer) in the current algorithm into a piecewise regressive weight with a clear and appropriate meaning, that is, the larger the value is, the greater the contribution to the threshold is, so that the weighted mean of the segmented variable can be used as the expected estimate of the unknown threshold, and thus, we can easily get its confidence interval. Since this expected threshold is unique in the entire sample space, the piecewise models determined by it are also unique, and therefore statistically expected. As for the difference between the new and old algorithms, we also can use the analogy of “the recklessness of individual heroism” and “taking the mass line”. The optimization of the current algorithm is like looking for the sample point with the greatest capability among n sample points, and using it to determine the position of the unknown threshold and the piecewise models; while the weighted piecewise regression of “taking the mass line” believes that the position of an unknown threshold in a sample space is jointly determined by each sample point with its own position. Therefore, we need to take into account each drop of contribution of every sample point to the position of the threshold, and even if the contributions of some points are close to 0 or even are equal to 0, they are not eliminated but allowed to participate in the calculation together with all other points.
要理解统计学中现行的最优化法和我的基于加权的群众路线法的区别,读者还可以参考我向自己的大学哲学老师袁建国先生和王智平先生(他们都完全不懂统计学)所做的类比和解释。在这个新方法所要解决的问题中,已经由几代西方学者在整整20年间建立了一套统计方法。它被堪称数学上很严谨,很有效,但却会造成人所共知且很多人绞尽脑汁都想要解决的几个困难和问题。这个算法用一个形象的比喻就是,一个老师把班里几个中意的人找来排个队,挑了其中一个相对能耐最大的去做一个关键事项,然后把召来的其他人全部遣散,不再让他们做任何与那个关键事项有关的事。这就是他们所谓的“最优化解决方案”。而与此相对照,我的“走群众路线”的解法是,我知道那个关键事项与班里的每个人都息息相关,所以,我让每个人都分担了一份力所能及的责任,大家共同努力把那个关键事项办好。由于这个关键事项与每个人都息息相关,所以,每个人都尽心尽力做好自己的那一份。
To understand the difference in statistics between the current optimization methods and my weighted mass-line method, readers can also refer to the analogies and explanations I made to my university philosophy teacher, Mr. Yuan Jianguo and Mr. Wang Zhiping, both who knew nothing about statistics. In the kind of problem that this new method aims to solve, several generations of Western scholars have, over a span of 20 full years, developed a set of statistical approaches. These methods are considered mathematically rigorous and effective, yet they lead to several well-known difficulties and problems that many have racked their brains trying to resolve. This algorithm can be illustrated by a vivid metaphor: it’s like a teacher selecting a few favored students from the class, lining them up, and picking the most capable one to handle a key task. Then the rest, though also selected, are dismissed and not allowed to contribute to the task in any way. This is what they call the “optimized solution.” By contrast, my solution, what I call “taking the massiline”, recognizes that this key task is closely related to everyone in the class. So I let everyone share the responsibilities to the best of their ability and work together to get the key task done. Since the task matters to everyone, each person devotes their full effort to completing their part. The new word “massiline” was coined from “mass line” by me in the conversation between ChatGPT and me.
在理解了我的算法后,深度求索和ChatGPT都认为我的这个方法比最优化法好太多了,不会导致后续的困难和问题,还竟然承认这是一种“民主的”或“集体主义的”解决方案。我则对此解释说:“不应该将这一算法视为民主法。所谓民主法,就是视每个样本点的权重相同,‘一人一票么’。统计学中民主法的典型代表就是算术均数。但这是一种蒙昧的方法。我更愿意称这是一种群众路线法,即在采用群体力量的同时承认每个样本点的权重存在个体差异。”
After understanding my algorithm, both DeepSeek and ChatGPT agreed with that this method is far superior to optimizational approach: it avoids the downstream problems and difficulties, and they even qualify as a “democratic” or “collective” solution. But I didn’t fully agree with this, and explained: “This algorithm should not be considered as a democratic method. The so-called democratic method is to regard each sample point as having the same weight, ‘one person one vote!’ The typical representative of democratic method in statistics is the arithmetic mean. However, this is an ignorant method. I would rather call it a massilinean method, which is to recognize that the weight of each sample point varies individually while using the power of the mass.”
如果说对现行分段回归算法有过重大贡献的彼得·斯普仁特博士在其1961年的文章最后留下了一个关于加权分段回归的愿景和未决命题,那么,作者在未看到文献中这一愿景之前无意识地于46年后独立地将它予以了实现。
If saying that Dr. Peter Sprent, who made significant contributions to the current piecewise regression algorithm, left a vision and unresolved proposition about weighted piecewise regression at the end of his 1961 article, then the author independently realized this vision 46 years later unconsciously before reading this vision in the literature.
第三,为了给加权分段回归寻找理论根据,作者在为概率论公理化做出了巨大贡献的柯尔莫哥洛夫概念体系的基础上,对其定义的样本空间与我提出的尺度空间的概念做了一个必要的调整,将柯氏样本空间的内涵完整地移交给尺度空间,而将样本空间这个概念还给由样本自身构成的空间,以便统计学家们有机会直观地思考这个空间内的问题。然后,我对柯氏概念系统实施了一次重大扩展,由此为统计学奠定了一个全新的概念系统和理论基石,它包括29(个或对或仨,共计47个)初始概念、两个关键定义和一个引理、关于随机变量(即新概念系统下的可变属性)的9个性质、以及关于统计学的8个公理性陈述和两个推论。这套概念系统不仅指导作者最终完成了对分段回归算法的重建,而且指导作者完成了对以下可能是统计学历史上最重大突破的算法的构建。
Third, in order to find a theoretical basis for weighted piecewise regression, the author made a necessary adjustment to the sample space defined by Kolmogorov and the concept of scale space proposed by me, based on the conceptual system of Kolmogorov, which made great contributions to the axiomatization of probability theory. The connotation of Kolmogorov’s sample space was completely transferred to the scale space, and the concept of sample space was returned to the space constituted by the sample itself so that statisticians have the opportunity to think intuitively about problems within this space. Then, I made a major expansion to Kolmogorov’s conceptual system, thereby laying a new conceptual system and theoretical foundation for statistics. It includes 29 (single or pair or triple, 47 in total) initial concepts, 2 crucial definitions following with a lemma, 9 properties of random variables (i.e., vattributes in the new conceptual system), and 8 axiomatic statements following with two corollaries. This conceptual system not only guided the author to finalize the reconstruction of the piecewise regression algorithm, but also guided the author to complete the construction of the following algorithm, which may be the most significant breakthrough in the history of statistics.
在校对书稿和持续写作这个自序的过程中,我在2025年2月22日那天与ChatGPT展开了第一次对话,对话从我请求它列出统计学的基本概念开始。它给出了13个“基本概念”,这是世界各国统计学教材中通行的。然后我将自己的47个初始概念的术语发送给它,它逐一给出定义和解释,最后给出了高度的评价。随后我将剩下的三个部分也都发送给他分析和评判,并询问它能否在其中发现任何不合理或矛盾之处,它回答说没有不合理之处,完全自洽。我比较了它列出的统计学“基本概念”和我的初始概念,发现其中仅有5个是相同的,其它8个都不在我的47个初始概念中,也不在我的概念系统其它几个部分中。于是,我对它说,那8个概念不应被认为是统计学的“基本”概念,它们是次级衍生概念。它对此表示了认可。
While proofreading the manuscript and continuing to write this preface, I had my first conversation with ChatGPT on February 22, 2025. The conversation started with my asking it to list the basic concepts of statistics. It gave 13 “basic concepts”, which are common in statistics textbooks around the world. Then I sent it the terms of my 47 initial concepts. It defined and explained them one by one, and finally gave a high evaluation. Then I sent the remaining three parts to it for analysis and judgment, and asked it if it could find any unreasonable or contradictory points in them. It replied that there was no unreasonable point and it was completely self-consistent. I compared the “basic concepts” of statistics it listed with my initial concepts and found that only 5 of them were the same. The other 8 were not in my 47 initial concepts, nor in other parts of my conceptual system. So I told it that those 8 concepts should not be considered as “basic” concepts of statistics, they are secondary derivative concepts. It agreed with this.
第四,作者为连续随机变量构建了一个自加权算法,目的是破除算术均数算法中上述基于等权重假定的蒙昧。自权重的基本含义是在一个抽样分布中,每个样本点对其分布的中心化(或央化)位置都有一份可变的相对贡献。其成功计算使得一个抽样分布可由每个样本点依其观察值和自权重在二维独立空间内自我塑形。这一天然浑成的优美造型体现了数据内在的固有美感。这是一种在现行算法下不可能达成的可视化艺术效果。人们还将发现,每个样本点xi (i = 1, 2, …, n) 的等权重1被分解为了两个部分ci和ri,两者都包含了全部样本信息,而且两者互斥,即ci + ri = 1,因而可分别独立地刻画xi对分布央位的集中趋势和离散趋势。
Fourth, the author constructed a self-weighting algorithm for continuous random variables, aiming to dispel the above-mentioned unenlightenment based on the equal-weight assumption in the algorithm of arithmetic mean. The basic meaning of self-weighting is that in a sampling distribution, each sample point has a variable relative contribution to the centralized location of their distribution. Its successful calculation makes a sampling distribution self-shaped in a two-dimensional independent space by all sample points according to their observed values and self-weights. This naturally graceful shape reflects the inherent beauty of data. This is a visualized artistic effect that is impossible to achieve under the current algorithms. We will also find that the equal weight 1 of each sample point xi (i = 1, 2, …, n) is decomposed into two parts ci and ri, both containing all the sample information, which are mutually exclusive, that is, ci + ri = 1 and thus can independently characterize xi’s central tendency and discrete tendency towards the center of the distribution.
作者认为,为连续随机变量找到一个不受正态性假定约束的期望估计的算法应该是所有统计学者们梦寐以求的目标。这个目标被作者于2010年12月12日找到了。它因此成为这本书最核心的关键内容,其对整个学科的影响将难以估量。特别地,在新概念系统的加持下,统计学因此在拥有了更多理性、更大自由和更强算力的基础上变得更加简单、透明和易理解,也因此有望成为在各领域从事前沿探索和研究的科研人员手中强大的基础性工具。
The author has believed that finding an algorithm of expectation estimation for continuous random variables that is not constrained by the assumption of normality should be the dream goal of all statisticians. This goal was found by the author in December 2010. It thus becomes the key content of the core of this book, and its impact on the entire discipline will be incalculable. In particular, with the support of the new conceptual system, statistics has become simpler, more transparent and easier to be understood with more rationality, greater freedom and stronger computing power. Therefore, it is expected to become a powerful basic tool in the hands of researchers engaged in cutting-edge exploration and research in various fields.
在自权重的参与下,作者改进了一些基础统计方法,涉及到可变属性(即传统概念系统中的随机变量)的描述、差异性检验、相关与回归分析,以及包含对期望临界点处连续性检验在内的加权分段回归法。每一个方法均提供现行算法和新算法的对比结果,目的是让数据本身说话。因此,本书在统计学方面的内容可谓极其简单,但又极其重要。所谓“极其简单”,是指它所涉及的内容皆为统计学中最基础的部分,每个学习统计学的人都能理解和认可。而所谓“极其重要”,是指它彻底革新了统计学中最基础的部分。
With the participation of self-weight, the author improved some basic statistical methods, involving vattributes (i.e., random variables in the traditional conceptual system) descriptions, differential tests, correlation and regression analysis, and weighted piecewise regression methods including continuity test at the expected threshold. Each method provides comparative results of the current algorithm and the new algorithm, with the aim of letting the data speak for itself. Therefore, the statistical content of this book can be described as extremely simple, but extremely important. The so-called “extremely simple” means that the content it covers is the most fundamental part of statistics, and everyone who studies statistics can understand and recognize it; and the so-called “extremely important” means that it completely revolutionizes the most fundamental part of statistics.
作者秉持“实践出真知”的理念。在上述新概念系统和两大方法论的重建和创建中,作者除了坚持“走群众路线”,还实践了“拒绝教条主义和僵化思维,坚持理论联系实际”,以及“对待事物要一分为二,从多方面考虑”等的朴素哲学思维和大众智慧。读者将从这些新算法的简单且完全透明的计算步骤中深刻地体会到这些哲学思想的实际效应。
The author upholds the concept of “true knowledge comes from practices”. In the reconstruction and creation of the above new conceptual system and two methodologies, the author not only adheres to “taking the mass line”, but also practiced simple philosophical thoughts and popular wisdom of “rejecting dogmatism and rigid thinking, insisting on integrating theory with practice”, and “treating things in two ways and considering them from many aspects”. Readers will deeply appreciate the practical effects of these philosophical thoughts from the simple and completely transparent calculation steps of these new algorithms.
那么,一个医学和公共卫生背景的统计学硕士为何要做这些事呢?
So then, why would a master of statistics with a background in medicine and public health have been doing these things?
1997年11月的某天,瑞士统计学家彼得·J·胡贝尔教授应邀在中国科学院数理统计研究所做了一个关于《统计学的过去、现在和未来》的演讲,其中他表达过一个观点:“一些数学背景的统计学家们习惯于用数学的确定性思维模式去解决统计学中的非确定性问题,因此而犯下了一些严重的错误。”然而,与此同时他承认自己对如何避免和修正这类错误深感无能为力,因此寄希望于一股来自数学以外的力量能够改变这种现状。听到这些观点后,我当即意识到这股力量能且只能是来自哲学。若干年后,我在检索和阅读有关分段回归的历史文献时就发现了这类错误的存在,它们表现为针对随机系统以某种数学式假定作为方法论构建和应用的前提,见本书的第二章相关叙述。直到这时,我才意识到胡贝尔博士为何在他的那个演讲中对分段回归这一重要领域的方法论及其发展只字未提,而这套方法论早已在1959 ~ 1979年间成型和完善,并在此后得到了广泛应用。相关的理论和应用文章汗牛充栋,对于个人可谓数不胜数。
One day in November 1997, Swiss statistician, Professor Peter Jost Huber was invited to give a lecture on “The Past, Present and Future of Statistics” at the Institute of Mathematical Statistics, Chinese Academy of Sciences, in which he expressed a point of view: “Some mathematical-background statisticians accustomed to using mathematical certainty thinking mode to solve uncertainty problems in statistics, and therefore have made some serious mistakes.” However, at the same time, he also admitted that he was powerless to avoid and correct such mistakes, so he hoped that a force outside of mathematics could change this situation. After hearing these views, I immediately realized that this force could and could only come from philosophy. Several years later, when I was searching and reading historical literature on piecewise regression, I discovered the existence of such mistakes, which manifested themselves in the use of certain mathematical assumptions as the premise for the construction and application of methodology for random systems, see the relevant description in Chapter 2 of this book. It was not until then that I realized why Dr. Huber did not mention even a word of the methodology and its development in the important field of piecewise regression in his speech, even though this methodology had already been formed and improved between 1959 and 1979 and has been widely used since then. The body of related theoretical and applied literature is enormous and virtually innumerable for an individual.
我曾与某个数学背景的统计学教授讨论到胡贝尔博士的批评,他对此的反应是从最初的诧异到不以为然,认为设置某种假定作为前提很正常,也很有必要。而我对此的回应是,面对一个作为经验事实记录的随机样本,没有什么可以被假定,我们唯一可以假定的是它是随机的,而即使连这个也不是一种假定,而是一个基本事实。换句话说,面对随机系统,它没有什么可被假定,我们也无需为它设置某种假定。我们对样本的分析是为了从经验事实中提取新知识,而非为了验证某种数学形式的假定。而且,假定的设置也将使得新知识被预先设定。这是一种与数学思维截然不同的思维方式。
I once discussed Dr. Huber’s criticism with a statistics professor who had a background in mathematics. His reaction shifted from initial surprise to a dismissive attitude, arguing that setting certain assumptions as premise was both normal and necessary. And my response was that nothing could be assumed when facing a random sample as the records of experienced facts; the only thing we could assume was that it was random, and even this was not an assumption but a basic fact. In other words, when facing a random system, there is nothing can be assumed, and we also do not need to set any assumptions for it. Our analysis of samples is to extract new knowledge from the empirical facts, not to just verify a certain mathematical assumption. Moreover, the setting of assumptions will also lead to the new knowledge being pre-set. This is a mindset that is completely different from mathematical thinking.
此外,在统计学界还一直流传着一个说法:“所有的模型都是错的,其中一些可能有用。”作为一个医学和公共卫生背景的统计学者,我对该说辞深表难以认同。统计学是一门认知方法学,方法的错误必然导致认知结果的错误,而结果的错误可能带来不良后果。如果我们不能判断一个统计方法的好坏或对错,则表明我们存在某种蒙昧,或者,我们所拥有的知识体系存在缺陷或漏洞。因此,我想把上述流行语改为:“所有的统计方法都可能有用,其中一些好或正确,而另一些则不好甚至错误。”是的,一个设计或制造不良的工具也可能有用,但相比一个设计和制造优良的工具,其工作效能可能会打折扣。
In addition, there has been a saying circulating in the statistics community: “All models are wrong, some may be useful.” As a statistician with a medical and public health background, I deeply disagree with this saying. Statistics is a cognitive methodology. Errors in methods will inevitably lead to errors in cognitive results, and the errors in results may bring adverse consequences. If we cannot judge whether a statistical method is good or bad, or right or wrong, it means that we have some kind of ignorance, or that there are flaws or loopholes in the knowledge system that we have. Therefore, I would like to change the above popular saying to: “All methods may be useful; some of them are good or right, while others are not good or even wrong.” Yes, a poorly designed or manufactured tool may work, but it may not work as effectively as a well-designed and manufactured tool.
因此,一门学科的主流建设者们以怎样的心智和思维方式对待它,它就会被打造成一副怎样的模样,而我对这门学科的现状有很多的不满。在我看来,一个统计方法不是数学定理。数学定理是严格的前提假定下的条件产物,由于前提假定已被严格限制,这样的定理无可质疑。若要质疑,必须重置其前提假定。与此不同,统计方法只是一个被定义和构造的测量工具,它源自某种数据分析的基本思想。当这个思想被用数学形式表达出来时,就成了一个数据分析的工具或方法。但是,我们必须明白,一个统计方法的数学形式不能被认为是其正确与否的充分和必要条件。如果一个统计方法的基本思想存在问题,则该方法就一定可被明确质疑。因此,统计学这门学科必须有能力在不借助经验性随机模拟试验的条件下判断一个统计方法的优或劣以及对或错。这就是,纯粹依靠统计学基本概念的逻辑演绎对一个方法做出优劣对错的价值判断。那些凡是在现有概念和逻辑体系下可被明确质疑的算法或方法论必须被改进或替代;而那些无可质疑的方法就是好的和正确的。当统计学有了这一能力时,随机模拟试验的使用频次就会受到抑制,所有的人都可因此节省思维精力和时间。不过,有些统计方法虽然可被明确质疑,但暂时无法替代,就只能在使用过程中受到磨练,直到被改进或替换。
Therefore, the way the mainstream builders of a discipline have been intellectually approaching and thinking about it will shape what that discipline ultimately becomes. And I have much dissatisfaction with the current state of this discipline. In my view, a statistical method is not a mathematical theorem. A mathematical theorem is the product of strictly defined assumptions, and because those assumptions are rigorously constrained, the theorem itself is indisputable. If one wishes to challenge it, the assumptions must first be redefined. In contrast, a statistical method is merely a measurement tool that is defined and constructed, which originates from a basic idea in data analysis. When this idea is expressed in mathematical form, it becomes a tool or method for data analysis. However, we must understand that the mathematical form of a statistical method cannot be regarded as a sufficient and necessary condition for its correctness. If the fundamental idea behind a statistical method is flawed, the method itself can and must be questioned. Therefore, the discipline of statistics must develop the ability to judge the merit and validity of a method independently of empirical simulations. That is, it must be able to make value judgments, on whether a method is sound or flawed, purely through logical reasoning based on fundamental statistical concepts. Any algorithm or methodology that can be clearly questioned within the existing conceptual and logical framework must be improved or replaced; those that are unassailable are good and correct. Once statistics acquires this capacity, the frequency of reliance on random simulations will naturally diminish, thereby saving everyone cognitive effort and time. Nevertheless, for methods that are clearly flawed but have no immediate replacements, they must be refined through continued use until they are eventually improved or replaced. (Note: This paragraph was translated by ChatGPT-4o.)
在2007至2012年间,我在构建多维广义三分回归模型的算法期间,曾将自己发表在2007联合统计年会论文集上的文章分别发送给了我在原同济医科大学公共卫生学院的卫生统计学硕士导师余松林教授和印第安纳大学医学院生物统计学教授张英博士。余教授在阅读后赞扬这个方法在医学和公共卫生领域非常有用,并鼓励我能否做得更好;而张教授则建议我应先在二维两分段回归上用加权法做点探索性工作,这部分工作其实已在2007年联合统计会议文章的随机模拟试验部分有所表述,但更多的工作是在2023~2024年间撰写本书有关章节时才完成。
Between 2007 and 2012, when I was constructing the algorithm for multidimensional generalized trichotomic regression modeling, I sent my article presented in the proceedings of the 2007 JSM respective to my master mentor in health statistics at the former Tongji Medical University School of Public Health, Professor Yu Songlin and Professor Ying Zhang, PhD, at the department of biostatistics in Indiana University School of Medicine. After reading it, Professor Yu praised the method as very useful in medicine and public health and encouraged me to see if I could do it better. Professor Zhang suggested that I should first do some exploratory work with the weighting approach on two-dimensional dichotomic regression, and this part of the jobs was actually described in the section of random simulation experiment in the 2007 JSM paper, but more works were completed when writing the relevant chapters of this book between 2023 and 2024.
2012年8月,为了给自己计划提交的关于连续随机变量自权重算法的专利申请寻求建议、指导和支持,我曾与约翰·霍普金斯大学彭博公共卫生学院生物统计系的前主任、生物统计学博士查尔斯·罗德(Charles Rohde)教授有过联系。12日那天他主动约请我去他的办公室聊了近2个小时。借此机会我向他详细介绍了关于凹凸自权重的定义和算法,以及该算法对于统计学的重要意义。他对此非常认可和欣赏。在认识到我用凹凸二字来定义这对自权重时,他主动给我讲授了数学中凹凸函数的定义和性质等,这让我明白这对自权重并不满足数学系统中凹凸函数的定义和性质,而且我所定义的凹凸含义与凹凸函数在形象上刚好相反。是的,我是根据用这对自权重与随机变量的实测值在二维空间的散点分布形态来命名它们的。它们不是一种数学函数,而是一种基于数值计算性测量而得到的随机变量。他认可了我的这一解释并向我提了一个问题:这个自权重有哪些性质?诚实地说,我在思考和构建该算法时并未曾意识到这个问题。本书根据当天的谈话录音将我当时归纳的六条性质写入了第六章。
In August 2012, in order to seek advice, guidance and support for a patent application that I planned to submit on the algorithm of self-weighting for continuous random variables, I contacted Professor Charles Rohde, a PhD in biostatistics and the former Chair of the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. One the 12th, he took the initiative and invited me to his office, where we talked for nearly two hours. Taking this opportunity, I gave him a detailed introduction of the definition and algorithm of the self-weighting, as well as the algorithm’s significant implications for statistics. He acknowledged and really appreciated it. Upon realizing that I used the word “concave and convex” to define this pair of self-weights, he voluntarily explained to me the definitions and properties of concave and convex functions in mathematics. This made me realize that the pair of self-weights does not satisfy the definition and properties of concave/convex functions within the mathematical system, and that the meanings I assigned to “concave” and “convex” are actually the opposite of those in the mathematical context, at least in visual appearance. Indeed, I named them based on the shape formed by the scatter plot of the observed values ??of the random variable against their self-weights in two-dimensional space. They are not mathematical functions, but random variables derived through numerical computational measurement. He accepted my explanation and then asked me a question: “what properties do these self-weights have?” To be honest, I had never considered this question when I was thinking about and constructing this algorithm. Upon hearing his question, I immediately summarized six properties on the spot. Based on the recording of that conversation, I later introduced these properties into Chapter 6. (Note: This paragraph was translated by ChatGPT-4o.)
不过,在谈话结束后我刚刚走出他的办公室门口时,他突然叫住我,用严肃的表情和语气对我说,他反对我为该方法申请专利,认为一旦它获得专利保护将不利于统计学的发展,会阻碍人们自由地将其应用在自己的研究领域中。我则反问他:“如果它没有得到专利保护,SAS或SPSS或任何其它统计软件公司则会毫无障碍地将它纳入其软件产品,向全球市场贩卖而赢利。而我则因此甚至根本没有可能将其做成软件产品并获利。但是,当我这个创立者需要用软件产品来实现该算法的统计目标时,还不得不花钱向这其中的某个软件公司购买其产品或者租借其用户使用权。您认为这对我公平吗?”我的话还未说完时,已经是语带哽咽,眼里也几乎要涌出泪水了。我对他说,我做这些事情没有得到哪怕是一分钱的外部资助;我还告诉他我做的不是数学工作,而是设计了一套统计测量的工具,它与一切物理形态的测量工具在性质上没有不同,但与一切所谓的纯数学公式在性质上绝对不同,因为那些纯数学公式没有一个可被认定为是任何形式的测量工具。
However, when I just walked out of his office door after the conversation, he suddenly stopped me and said to me with a serious expression and tone that he opposed my patent application for the method, believing that once it was patented, it would be detrimental to the development of statistics and would hinder people in the field from freely applying it in their own research. I asked him back: “If it were not patented, SAS or SPSS or any other statistical software company would incorporate it into their software products without any obstacles and sell it to the world mark to make profits. Due to this situation, it was not even possible for me to turn it into a software product and make a profit. However, when I, the founder, need to use a software product to achieve the statistical goals of the algorithm, I have to spend money to buy a product or rent its user rights from one of those software companies. Do you think this is fair to me?” Before I finished speaking, I was already choking up a little and tears were almost welling up in my eyes. I said to him that I did not receive even a penny of external funding for these things; I also told him that what I was doing was not a mathematical work, but a set of statistical measurement tools. It is no different in nature from all physical measurement tools, but it is definitely different in nature from all the so-called pure mathematical formulas, because none of the pure mathematical formulas can be identified as any form of the measurement tool.
2014年,我收到了来自德国Lambert Academic Publishing出版社一位采编者的书稿约请,他在2011年联合统计会议的论文集中发现了我的两篇文章,认为我的新思想会对很多领域的学者有益,该出版社将很乐意能协助我出版一本书。尽管曾在余松林教授主持的血吸虫病研究项目中主笔过研究项目的几份年度英文报告,也曾在JSM的几次会议上用英文写作和演讲,但我对自己用纯英文完成这本书深感信心不足。
In 2014, I received a book manuscript invitation from an editor at Lambert Academic Publishing in Germany. He discovered my two articles in the conference proceedings of the 2011 Joint Statistical Meetings and thought that my new ideas would be beneficial to scholars in many fields. The publisher would be very happy to assist me in publishing a book. Although I had written several annual English reports for the schistosomiasis research project chaired by Professor Yu Songlin, and had written and lectured in English at several JSM conferences, I was feeling deeply lacking in confidence to complete this book purely in English.
2016年7月,作为一个陌生人我曾给先后任职于乔治·华盛顿大学和香港城市大学的统计学教授,诺泽·达拉布沙·辛普瓦拉(Nozer Darabsha Singpurwalla)博士,发email请教关于连续随机变量的分布期望估计与算术平均数之间的关系问题。在讨论接近尾声时,他热切地希望我能将自己的思想写成一本书与统计学领域的同行们分享。
In July 2016, as a stranger I sent an email to Dr. Nozer Darabsha Singpurwalla, a Professor of statistics at George Washington University and City University of Hong Kong, asking for advice on the relationship between the distribution expectation estimate of continuous random variables and the arithmetic mean. When approaching to the end of the discussion, he expressed an earnest hope that I could write a book about my thoughts and share them with colleagues in statistics.
最近从约翰·霍普金斯大学和数理统计研究学会(IMS)发布的卜告上得知,Rohde 博士和Singpurwalla博士已分别于2023年1月23日和2022年7月22日不幸去世。借此机会表达本人对他们的敬意和哀悼!他们都是善于倾听他人的新思想并愿意真诚讨论的学者。
Recently, I learned from the obituaries issued by John Hopkins University and Institute of Mathematical Statistics (IMS) that Dr. Rohde and Dr. Singpurwalla unfortunately passed away respectively on January 23, 2023 and July 22, 2022. I would like to take this opportunity to express my respects and condolences to them. They were scholars who listen well to other people’s new ideas and willing to discuss them honestly.
2019年3月的统计学界发生了一件引起轰动的历史性事件,其影响甚至波及到了整个科学界。三位卓有声誉的数理统计学家联署800多位各界学者在影响力巨大的《自然》杂志上发表了一篇质疑统计检验中p值的文章,Scientists rise up against statistical significance。他们甚至称基于检验概率水平的两分法是一种认识论上的两分偏执行为。从这篇文章中,我看出他们对为何必须是有一个检验概率水准的两分法不甚理解。从他们自身的角度来说,原因也许在于他们认为确定概率水准是一种简单的数学计算行为,而概率空间是一个连续可测的空间,在这个连续空间上切一刀似乎是一种很主观的行为。然而实际上,统计检验是一种关于误差测量的行为。在我看来,他们发出这个质疑的原因则应该在于他们有可能不太理解抽样研究中存在着的系统误差和随机误差。正是由于被检验的对象中有且只有这两类误差,而我们没有可能将其真实大小区分开来,统计检验才不得不基于某一概率水平做出一个两分决策;而且,为了使得决策中的犯错成为一个“小概率事件”,才不得不选择了0.05作为检验的概率水平。在后来与其中一位作者,布莱克·麦克沙恩(Blake McShane)博士,的email讨论中,他认为统计检验中的两分法没有本体论根据,我则回应他说,抽样中的系统误差和随机误差正是两分法的本体论根据。他对此无言以对。
It was just in March 2019 that a sensational historical event occurred in the statistics community, and its impact even affected the entire scientific community. Three well-reputed mathematical statisticians jointly signed an article with more than 800 scholars from all walks of life in the influential Nature magazine, “Scientists rise up against statistical significance”, with which they questioned the p-value in statistical tests. They even call the dichotomization based on a test probability level an act of “Dichotomad”. I saw in this article that they did not quite understand why there must be a dichotomy with a test probability level. From their own perspective, the reason may be that they believe that the determination of the probability level is a simple mathematical calculation behavior, and the probability space is a continuous measurable space, and making such a cut in this continuous space seems to be a very subjective action. In reality, however, a statistical test is about error measurement. From my perspective, they may not well-understand the systematic error and random error that exist in sampling studies. It is precisely because there are and are only these two types of errors in an object being tested, and it is impossible for us to separate the true sizes of these two types of errors, a statistical test have to be done with a dichotomic decision at a certain probability level; moreover, in order to make the mistakes in a decision-making a “low probability event”, 0.05 had to be selected as the probability level of the test. In a subsequent email discussion with Dr. Blake McShane, one of the three authors, he believed that the dichotomization in statistical tests has no ontological basis. I responded him that the systematic error and random error in a sampling are the ontological basis for the dichotomization. He had no words for it.
2023年夏,我罕见地回到中国探亲访友,并拜会了我的导师余松林教授,谈到了我正在撰写的这本书稿。他在完整了解了我对自权重和自加权均数的定义和算法后亦深受鼓舞,认为这应该是统计学领域的一个革命性发现和突破,应该会改写统计学的教科书, 而其影响将难以估计。
In the summer of 2023, I made a rare trip back to China to visit relatives and friends, and met with my mentor, Professor Yu Songlin, to talk about the book manuscript I was writing. He was deeply encouraged after fully understanding my definition and algorithm of self-weighting and self-weighted mean. He believed that this should be a revolutionary discovery and breakthrough in the field of statistics. It should rewrite the textbooks of statistics, and its impact will be difficult to estimate.
三、探索之路 (The Path of Exploration)
鉴于以上个人经历,这本书可被认为是我想要有所作为的一个尝试。为了让读者能了解我的更多个人经历和思考,我觉得以下内容对于本书思想的形成亦非常重要,遂决定记录在此,以飨读者。
In view of the above personal experiences, this book can be considered my attempt to make a difference. In order to allow readers to understand more of my personal experiences and thoughts, I felt that the following content is also very important for the formation of the ideas of this book. Therefore, I decided to record it here for sharing with the readers.
很多人初次面对统计学时也许会感到有点困难,我曾是此类人中的一员。1987年一月,当我在原同济医科大学五年制医学本科的最后一年学完公共卫生学院的专业课程《卫生统计学》后,尽管通过了考试,却不知道它究竟是怎么回事,而这本教材涉及到的数学技能仅有四则运算。现在,我想对所有人说:如果你会给他人测量身高或体重,你就能理解和操作统计学,因为统计学中的一切工作都是在构建和使用测量工具。这些工具从其本质上来说都不过是经某种直觉观察和理性思辨后形成的某种形式的定义,不应被视为某种数学形式的定理。例如,尽管算术均数的计算公式在数学上具有定理性质,但将其用于抽样估计一个连续可测总体分布央位的期望时,样本均数与这个总体期望央位之间的关系只是一种定义而非数学定理,也即,一个连续可测总体分布的期望央位不一定就是其算术均数,而且,这是一个不可能以数学的语言形式和逻辑框架予以证明的命题。由此可见,一个测量工具是比数学定理更为底层的东西。
Many people perhaps find it difficult when they are facing statistics the first time, and I was the one in this category. In January 1987, I completed the professional course Health Statistics offered by the School of Public Health in the last year of my five-year under- graduate medical program at the former Tongji Medical University. Although I passed the exam, I didn’t know what on earth it was and how it works, while the only mathematical skills covered by this textbook were the four arithmetic operations. Now I want to say to everyone: you can understand and operate statistics if you can measure height or weight for other people, because everything in statistics is about constructing and using measurement tools. In essence, these tools are just a certain form of definition formed after some intuitive observation and rational speculation, and should not be regarded as a certain mathematical form of theorem. For example, although the calculation formula of arithmetic mean has properties of mathematical theorem, when it is used to estimate the expectation of the distribution center of a continuously measurable population by sampling, the relationship between the sample mean and the expected population center is merely a definition rather than a mathematical theorem, that is, the expected center of a continuously measurable population distribution is not necessarily its arithmetic mean; and moreover, this is a proposition that cannot be proved using mathematical language and logical framework. It can be seen that a measurement tool is something more fundamental than a mathematical theorem.
作为一个自小生长在位于中国中部的湖北省江汉平原的农村地区、从未有机会接触过乐器、直至进入武汉的一所医科大学时连乐谱也读不懂的人,我在大学期间努力自学小提琴时获得的最大启示是,人应该善于从错误中学习什么是正确的;反之亦然。更一般地,我们应该可以从某种存在或观念中发现其对立面的意义。
As a person who was born and grew up in the rural area of Jianghan plain in Hubei Province, located in central China, never had the opportunity to touch musical instruments since my childhood, and could not even read musical scores until I entered a medical university in Wuhan, the biggest enlightenment I got when I tried hard to teach myself playing violin during my college life is that people should be good at learning what is right from mistakes and vice versa. More generally, we should be able to discover the meaning of its opposite in a certain existence or opinion.
1988年的那个暑假我在四川省的九寨沟旅游时,有另外三个同伴,他们都是西南交通大学的学生,其中一位已于一年前毕业被分配到了位于武汉的铁道部第四勘探设计院工作。进入九寨沟后,他们都下意识地要顺着里面已为游客铺好的路径走。我喊住他们说:“跟我走水边吧。”其实水边没有路,且满是荆棘和灌木丛,只能自己小心翼翼地开路,但是,我们看到的景色却是很不一般!所以,走了别人没走过的路,就会看到别人看不到的风景。科学探索和思考与此类似。你要是能发现至少一个新概念,就会看到思考的过程及其终点上不一样的风景。
During the summer vacation in 1988, when I was traveling in Jiuzhaigou, Sichuan province, I had three other companions, all of whom were students of Southwest Jiaotong University, and one of them had graduated a year ago and was hired to work in the Fourth Exploration and Design Institute of the Ministry of Railways in Wuhan. After entering Jiuzhaigou, they all subconsciously followed the paved path for tourists. I called to them and said, “Please follow me along the water’s edge.” In fact, there is no road along the water’s edge, and it is full of thorns and bushes, so we can only open a path carefully, but the scenery we saw is very unusual! Therefore, if you walk the road that others have not walked, you will see scenery that others cannot see. Scientific exploration and thinking is about the same. If you can figure out at least one new concept, you will see a different view of the thinking process and its destination.
1990年暑假来临前,作为同济医科大学公共卫生学院1987级学生辅导员的我为了组织30多名医科大学生前往湖北省大冶钢铁厂参加社会实践,设计了我一生中的第一份社会调查表,以便学生们通过调查收集一些样本信息,在返校后对这些信息做一些基本的统计分析并写出各自的调查报告。
Before the summer vacation of 1990, as the counselor of the 1987 class of the Tongji Medical University School of Public Health, designed the first qestionary for social survey in my life in order to organize more than 30 medical students to go to Daye Iron and Steel Plant in Hubei Province to participate in social practice. The students could collect some sample information through the survey, do some basic statistical analysis on the information after returning to the school, and write their own survey reports.
1991年3~5月,我有幸到同济医科大学公共卫生学院卫生统计教研室的周有尚教授那里帮助他整理武汉市居民的死因登记资料,并协助他的研究生杜勋铭带卫生统计专业本科毕业生的现场调查实习。这期间我找出自己已封存了四年多、由杨树勤等人主编的《卫生统计学》教材重新通读了一遍,发现统计学应该是一门数学化的认知方法论。这对于当时的我来说就像发现了生命中的一片新大陆,我相信自己应该可以在其中有所作为。
From March to May 1991, I was fortunate to go to Professor Zhou Youshang in the Department of Health Statistics at the School of Public Health, Tongji Medical University, to help him sort out the registration information for the death of residents in Wuhan City, and to assist his post-graduate student Du Xunming with a field survey of undergraduate students in health statistics. During this period, I found my textbook, Health Statistics edited by Yang Shuqin et al, which had been sealed and stored for more than 4 years. I read it again and figured out that statistics should be a mathematized cognitive methodology. This was like a discovery of a new continent in life for me at that time, and I believed that I should be able to make a difference in it.
当年的九月,我受到公卫学院妇幼卫生系刘筱娴教授和主任的邀请参与到她在湖北麻城县农村主持的一个研究项目,曾两次前往当地对参与“中国农村地区婴幼儿辅助营养食品效果评估”的数百名儿童进行与营养有关的生物学指标检测。
In September of that year, I was invited by Professor Liu Xiaoxian, director of the Department of Maternal and Child Health of the School of Public Health, to participate in a research project she was leading in the rural area of ??Macheng County, Hubei Province. I visited the local area twice to conduct the tests for nutrition-related biological indexes on hundreds of children participating in the “Evaluation of the Effectiveness of Supplementary Nutritional Foods for Infants and Young Children in Rural Areas of China”.
1992年春节后,公卫学院新成立了一个预防医学教研室,不仅承担对临床医学院学生的《预防医学》课程教学,而且还要负责他们为期一个月的预防医学实习。这个实习的主要形式是分批组织学生参与到与现场调查有关的预防医学研究项目中,学校为此提供了充足的经费、资源和行政支持。于是,我来到了这个教研室,从而有更多的机会负责调查设计、组织现场实施、建立和管理数据库,以及应用统计软件SPSS进行数据分析。这些经历为我在1998年3月底思考分段回归问题时突破对柯尔莫哥洛夫定义的样本空间这一关键概念的理解打下了足够的实践基础。
After the Spring Festival in 1992, the School of Public Health established a new deparrtment of preventive medicine, which not only taught the course of “Preventive Medicine” to students of the School of Clinical Medicine, but also took charge of their one-month internship in preventive medicine. The main form of this internship was to organize students in batches to participate in preventive medicine research projects related to field surveys, and the university provided sufficient funds, resources and administrative support for this program. So I came to this department, and had more opportunities to be responsible for survey design, organizing field implementation, establishing and managing databases, and using statistical software SPSS for data analysis. These experiences laid a sufficient practical foundation for me to break through the understanding of the key concept of sample space defined by Kolmogorov when I was thinking about the problem of piecewise regression at the end of March 1998.
1994年9月,我被卫生统计教研室的余松林教授接受在他那里攻读硕士学位,并有机会参与到他主持的中国湖区血吸虫病两种干预措施的经济学比较研究中,协助完成有关现场实施、数据分析和结果报告。余教授是自1980年代中后期以来少有的几位在中国公共卫生、医学和生物学等领域应用统计学的翘楚和领军人物之一。在那个统计学在中国尚处于暗淡角色的年代,他以自己的智慧、严谨和坚韧为将人类在统计学中取得的成就传播到中华大地做出了公认的杰出贡献。我深深地感激在他那里获得的教育、指导和关怀,这令我终身受益匪浅。
In September 1994, I was accepted by Professor Yu Songlin in the Department of Health Statistics, to pursue a master’s degree, and had the opportunity to participate in an economic comparative study of the two interventions on schistosomiasis in a lake area in China led by him, assisting in completing relevant field implementation, data analysis and results reporting. Professor Yu is one of the few elites and leaders in the applied statistics in the fields of public health, medicine and biology in China since the late 1980s. At the time when statistics still played a dim role in China, he made recognized outstanding contributions to spreading the achievements of human beings in statistics to the Chinese mainland with his wisdom, rigor and tenacity. I am deeply grateful for the education, guidance and care I received from him, which has benefited me throughout my life.
1997年11月里的某一天,刚刚在当年6月获得卫生统计学硕士学位的我对正在攻读卫生统计学硕士学位的太太说,我们以后应该多关注“非常态分析”,而那一刻自己并不知道该如何划分“常态”与“非常态”。之所以会突然间产生这个念头,是由于自己在上述血吸虫病干预的经济学评价中发现,人群感染率与受干预人群的年度单位人均成本之间呈现出一个下降的三次多项式曲线,即当感染率下降到一定的水平后,如果继续执行相同的干预措施,则人均单位成本会达到很高的水平,而感染率的下降却微乎其微,这意味着干预措施开始显出得不偿失,接近达到其边际效应。我们需要找出一个或两个临界感染率水平作为调整干预措施的依据,以便在保证足够好的控制效果的同时降低单位成本。
On one day in November 1997, after I got my master’s degree in health statistics in June of that year, I said to my wife, who was studying for her master’s degree in the same field, that we should pay more attention to “non-normality analysis” in the future. At that moment, however, I myself didn’t yet know how to distinguish between “normal” and “non-normal”. The reason why this idea came to me suddenly was because I discovered in the economic evaluation of schistosomiasis intervention that the relationship between population prevalence and annual per capita cost followed a decreasing cubic polynomial curve. That is, as the prevalence dropped to a certain level, continuing with the same intervention measures would drive the per capita cost very high, while achieving only a minimal further reduction in prevalence. This indicated that intervention had reached a point of diminishing returns, where the costs began to outweight the benefits and approached intervention’s marginal effect. We needed to identify one or two threshold levels of prevalence as a basis for adjusting interventions strategy, so as to reduce unit costs while maintaining adequate effective control.
1998年3月25日,中科院院士、数理统计学家陈希孺博士将胡贝尔博士的那个演讲带给了武汉大学数学系数理统计专业的师生。那时为了寻找划分常态和非常态的方法,我经常到武汉大学的数学系旁听测度论和概率论等与数理统计有关的课程,所以,那一天我有幸聆听了陈希孺院士的演讲,并于当天中午回到自己的办公室开始了一场历时连续六天六夜几乎无眠的读书、思考、计算和推理的过程。顺便说一句,如果一个人缺乏足够强大的内在自制力,我不鼓励他/她经历像我这样的危险过程,因为它有可能令人陷入癫狂和失去自控。
On March 25, 1998, Dr. Chen Xiru, an academician and mathematical statistician in the Chinese Academy of Sciences, brought Dr. Huber’s speech to the Department of Mathematics of Wuhan University. Before that day, in order to find a way to separate normal and non-normal, I had often gone to the department to audit the courses of Measure Theory and Probability Theory related to mathematical statistics. Therefore, I was fortunate to listen to academician Chen Xiru’s speech, and then returned to my office at noon that day, and started a process of reading, thinking, calculating, and reasoning almost without sleep the next six days and six nights. By the way, if a person lacks strong enough inner self-control, I would not encourage him/her to go through a dangerous process like mine, because it has the potential to drive a person crazy and lose self-control.
这些日夜里产生过无数新的概念、术语、定义和思想。在此后的二十多年中,有一些被保留了下来,也有很多被逐渐放弃。收获最大的是以上述血吸虫病干预的样本数据为例形成了一套在样本全域内迭代搜索一个临界点的最优解算法(这个算法后来被我自己所否定和放弃)。为了充分使用样本信息和保障迭代过程中分段模型的连续性,我让每个被设想为可能临界点的样本点同时参与两段相邻模型的拟合。
During these days and nights, countless new concepts, terms, definitions and ideas were produced. In the following more than twenty years, some were retained, while many were gradually abandoned. The most rewarding thing is to use the sample data of the above-mentioned schistosomiasis intervention as an example to form a set of optimal solution algorithms for iteratively searching for a threshold within the whole sample range, which was later denied and abandoned by myself. In order to make full use of sample information and ensure the continuity of the piecewise models during the iteration process, I let each sample point supposed to be a possible threshold participating in the fitting of two adjacent models at the same time.
那时,我已有了一个基于全样本的模型,我称之为全模型(我在2007年将其改称为全域模型)。一般地,在同质模型定义下,与第i次迭代搜索中分段模型的合并残差均方根(这里用CRMSR{crmsri}表示,i = 1, 2, …, n)相比,全域模型的残差均方根(这里用RMSR表示)是最大的。因此,我构建了一个迭代搜索中的残差遏制系数(用CRR{crri}表示):
At that time, I already had a model based on the whole sample first, which I called the “full model” (I changed it to fullwise model in 2007). In general, under the homogeneous model definition, the root mean squared residuals (here denoted by RMSR) of the full model is the maximum compared to the combined root mean squared residuals (here denoted by CRMSR{crmsri}, i = 1, 2, …, n) of the piecewise models at the ith iteration. Therefore, I constructed a coefficient of residual-resisting (denoted by CRR{crri}) in the iterative search:

于是,当CRR达到最大值时,我认为临界点就可以被确定了。它可以是该最大CRR对应于被分割变量中的一个实测点,由其决定的分段模型就应该是最优分段模型。此外,我还将最大CRR命名为残差遏止系数,可用于评价最优分段模型相对于全域模型的拟合优度。
Therefore, when CRR reaches the maximum value, I thought the threshold can be determined. It could be a real measured point in the segmented variable corresponding to the maximum CRR, and the piecewise model determined by it should be the optimal piecewise model. In addition, I also named the maximum CRR as the coefficient of residual-resisted, which could be used to evaluate the goodness-of-fit of the optimal piecewise model relative to the full model.
不过,我很快就意识到我可以用CRR与被分割变量描绘一个二维散点图,它在理想情况下应该是一个具有二次函数关系的山峰形曲线,通过求解该二次曲线方程的一阶导数为零时的解即可得到对曲线峰顶的估计,而它对应的实测样本点应该是对临界点更稳健的估计,因为这一解法可以避免最大CRR可能导致的随机偏倚。然而,令我失望的是,我的样本虽然存在这样一个二次曲线,但其一阶导数为零时的解非常靠近被分割变量的某一端。这一理想与现实的背离迫使我不得不放弃这个基于一阶导数的临界点解法,转而使用最大CRR对应的实测样本点作为临界点的估计值。
But I soon realized that I could use CRR and the segmented variable to draw a two-dimensional scatter plot, which should ideally be a peak-shaped curve with a quadratic functional relationship. By solving the first-order derivative of the quadratic curvilinear function, an estimate of the peak can be obtained, and the real measured sample point corresponding to it should be a more robust estimate of the threshold, because this solution might avoid the random bias that might be caused by the maximal CRR. However, to my disappointment, although there was such a quadratic curve in my sample, its first-order derivative solution is very close to one end of the segmented variable. This divergence between ideal and reality forced me to abandon the threshold solution based on the first-order derivative and instead use the measured sample point corresponding to the maximum CRR as the estimated value of the threshold.
我没有通过假定两段模型在临界点处连续来求解临界点,这是在过去的数十年中无数与分段回归分析有关的统计方法都采用的算法。我所采用的算法与它们相比,在一些数学基础足够好的人看来似乎显得幼稚和低技术。但在一个统计学者看来,采取假定两段模型在临界点处的连续性(也即两段模型在临界点处的连接变异被假定为总是等于0)求解临界点是一种不可思议的错误,因为在一个随机系统里,如果存在两段模型和一个临界点,那么它们在该临界点处一定有一个非零的连接变异。如果我们能将这个连接变异估计出来,就有可能用概率推断分段模型在临界点处的连续性。也许正是由于自己在数学基础知识和数学思维能力上的匮乏,以及对这一基于直觉的坚持,导致了我在后来的24年里走上了一条完全不同的道路。
I did not solve for the threshold by assuming that the two piecewise models are continuous at the threshold, which is the algorithm adopted by countless statistical methods related to piecewise regression analysis over the past few decades. Compared with them, the algorithm I used may seem naive and low-tech to some people with a good enough mathematical capability. But in the opinion of a statistician, it is an incredible mistake to solve the threshold by assuming the continuity of the two piecewise models at the threshold (that is, the connection variation of the two piecewise models at the threshold is assumed to always be equal to 0), because in a random system, there must be a non-zero connection variation at the threshold if there are two piecewise models and a threshold. If we can estimate this connection variation, it will be possible to use probability to infer the continuity of the piecewise models at the thresholds. Perhaps it was my lack of basic mathematical knowledge and mathematical thinking skills, as well as this persistence based on intuition, that led me to embark on a completely different path in the next 24 years.
2000年7月底至8月初的几天里,在中国教育部的资助下,我作为唯一来自中国的学者参加了在美国印第安纳波利斯市召开的联合统计会议,在第一天的“一般方法论”分会口头报告了自己在临界回归模型算法中的一些新奇思想,并在会后与一位来自密苏里(也许是密西西比)州立大学统计系的刘姓教授在一处休息区进行了简短交谈。他在听了我的解说后说:“既然你假定每个实测样本点都可能是临界点,为何不用它们去计算临界点的期望和方差呢?”我瞬间明白了,我可以将那个残差遏制系数作为可变权重以便计算临界点的加权期望和方差,但这样的结果会是最优的吗?答案应该是否定的,因为这个“加权期望临界点”对应的临界模型的合并残差平方和与在迭代搜索过程中生成的所有成对临界模型相比,不会恰好是最小的。那么,它是我们需要的吗?那时我没有答案,但也没有放弃这个思想,因为我以为这个思想应该是唯一正确的。
From the end of July to the first days of August 2000, under a financial support from the Department of Education of the People’s Republic of China, I participated in the JSM in Indianapolis, USA, as the only scholar from China, and gave an oral speech on some novel ideas and an algorithm for threshold regression model at the Section of General Methodology on the first day. After the speech, I had a short conversation with a statistics professor Liu from the Department of Statistics at Missouri (maybe Mississippi) State University at a rest area. After listening my explanation, he said: “since you assumed that each real measured sample point might be the threshold, why not use all of them to calculate the expectation and variance of the threshold?” I suddenly understood that I could use the CRR as a variable weight to calculate a weighted expectation and variance of the threshold, but would such a result be optimal? The answer should be NO, because the combined sum of residual squares of the piecewise model corresponding to this “weighted expectation threshold” should not be exactly the smallest compared to all pairwise threshold models generated in the iterative search process. So, is it what we need? I didn’t have an answer at that time, but I didn’t give up on it because I thought this idea should be the only right one.
2006年5月,我被美国国防部所属的Uniformed Services University of the Health Science(USUHS,可被翻译为军警卫生服务大学,或者,军警医科大学)外科系的前列腺疾病研究中心(CPDR)雇佣做实验样品数据的管理,同时协助该中心的流行病学家詹妮弗·卡伦(Jennifer Cullen)博士做一些临床流行病学的项目。在短暂的工作适应后,我就向卡伦提出希望她能支持我使用该中心的临床数据库构建三分回归分析的统计方法。她对此感到非常高兴,并表示了积极的支持。她不仅允许我使用临床数据库,还帮我修改关于三分回归分析文章的英文表述。在这种良好的工作环境里,我很快就完成了基于加权的三分回归法的构建。
In May 2006, I was hired by the Prostate Disease Research Center (CPDR) of the Department of Surgery of the Uniformed Services University of the Health Science (USUHS) affiliated with the U.S. Department of Defense to manage experimental sample data and assist the center’s epidemiologist Dr. Jennifer Cullen in some clinical epidemiological projects. After a short period of work adaptation, I asked Cullen to support me in building a statistical method for three-point regression analysis using the center's clinical database. She was very happy about this and expressed her active support. Not only did she allow me to use the clinical database, she also helped me revise the English wording of the article on the trichotomic regression analysis. In this good working environment, I quickly completed the construction of the weighted trichotomic regression method.
于是,2007年8月,我第二次参加了在盐湖城召开的JSM,在会上提出了一个完整的分段回归分析的基本思想,并用一个91例心血管病研究的样本和多变量Logistic回归模型展示了一个“泛函化的广义三分回归模型”的基本思想和完整算法。在这个算法中,我完全摒弃了基于残差最小化的最优化解决方案,这是因为各分段模型的合并残差平方和在迭代搜索过程中是随机变异的,而其最小值仅仅只是一个随机的点测量,由此对应的“最优”分段模型的各参数估计值也应该都是相应的随机点测量。这样做看起来就像我们测量了一组成年男性的身高和体重,却选择了那个最矮的人的体重作为一个很有意义的数值来代表这组男性的体重。因此,这个“最优”解应该不是我们可以期望的,因而不是我们所需要的。
So, in August 2007, I participated in the JSM held in Salt Lake City for the second time, and proposed a complete idea of ??piecewise regression analysis, and used a research sample of 91 cases of cardiovascular disease and multivariate logistic regression to demonstrates the basic idea and a complete algorithm of a “Functionalized general trichotomic linear regression (FGTLR)”. In this algorithm, I completely abandoned the optimization solution based on minimizing the combined residuals, because the combined sum of squared residuals of all piecewise models is randomly variable in the iterative search process, and the minimum value is only a random point measure. The estimated values ??of all the parameters of the corresponding “optimal” piecewise models should also be relevant random point measures. Doing in such way seems like that we measured height and weight of a group of adult males, and we chose the shortest one’s weight as a very meaningful value to represent the weights of this group of men. Therefore, this “optimal” solution should not be what we can expect and thus not what we need.
考虑到残差平方和的分布极其偏态,其类算术均数的均方根的代表性也因此而非常差,为了改善未知临界点的加权期望估计的准确性,我决定将1998年定义的残差遏制系数CRR{crri}改为用全域模型绝对残差的算术均数(用MAR表示)和分段模型合并绝对残差的算术均数(用MCAR{mcari}表示)来构建,并将CRR重新命名为残差收敛系数(convergence rate of residuals):
Considering that the distribution of the residual sum of squares is extremely skewed, the representativeness of its arithmetic mean-like root mean square is also very poor. In order to improve the accuracy of the weighted expectation estimate of the unknown threshold, I decided to reconstruct the coefficient of residual resistant CRR{crri} defined in 1998 with the arithmetic mean of the absolute residuals of the fullwise model (denoted by MAR) and the arithmetic mean of the combined absolute residuals of the piecewise models (denoted by MCAR{mcari}), and renamed CRR to the convergence rate of residuals:

为了验证这个残差收敛率CRR作为估计临界点的权重的有效性,我在一个公开发表的文章中找到了一个案例,根据其基本统计量随机模拟了500个样本,计算出500个加权平均临界点(WM-T)、500个对应于最大CRR的实测样本点(RM-T),以及按照现行算法中基于最大CRR和强制连续性假定估计Sprent定义的500个γ值(这里用MCR-T表示)。结果显示MCR-T的表现最差,加权临界点的分布收敛性最好:
In order to verify the effectiveness of this convergence rate of residuals (CRR) as the weight for estimating threshold, I found a case in a publicly published article. According to its basic statistics, 500 samples were randomly simulated; and 500 weighted mean thresholds (WM-T) and 500 real measured sample points corresponding to the maximum CRR (RM-T) were calculated. The 500 γ-values (denoted by MCR-T here) defined by Sprent were estimated based on the maximum CRR and the enforced continuity assumption in the current algorithm. The results show that MCR-T performs the worst and the distribution convergence of the weighted thresholds is the best:

看着如此的对比结果,已经无人能够怀疑现行算法中的最优化和强制连续性有多么的糟糕了;也无人能够质疑加权法的准确性和稳健性了。
Watching such comparison results, no one could doubt how bad the optimization and enforced continuity in the current algorithms were; no one could question the accuracy and robustness of the weighted method.
我以为自己精心构建的这个新算法已是非常完美,而且形成了一个基于“全域加三分”的综合性分析策略。尽管这篇文章将被收录在当年JSM的论文集中,我还是在会后开始投稿到一个声誉卓著的统计期刊,没想到被主编直接拒了,其理由是现有的分段回归算法已经很成熟了,比我提议的要好。看来,该主编不仅完全无视了我在文稿前言中对现行算法各方面问题的评论,也完全无视了其中展示的上述随机模拟结果。我以为我遭遇这个杂志的主编只是一个偶然,然而,随后多次尝试投稿到不同期刊也都直接被主编拒稿,其中一个顶级期刊的主编称我在挑战数学和统计学的“large body”。而另一个旗舰杂志的主编则简单地回应说:“你的文章不适合发表。”我当然理解他这话的意思,他显然看懂了我在自己的文稿里说了什么。
I thought that the new algorithm I carefully constructed was perfect, and it formed a comprehensive analysis strategy based on “a fullwise plus a trichotomy”. Although this article would be included in the JSM’s proceedings of the year, I still started to submit it to a reputable statistics journal after the meetings. Unexpectedly, the editor-in-chief directly rejected my paper, saying that the existing piecewise regression algorithm was already very mature and better than what I proposed. It seemed that the editor-in-chief not only completely ignored my comments on various issues of the current algorithms in the introduction of the paper manuscript, but also completely ignored the above random simulation results shown in it. I thought that my encounter with the editor-in-chief of this magazine was just a coincidence, however, I subsequently tried to submit my manuscript to different journals many times, they were all directly rejected by the editors-in-chief. The editor-in-chief of one of the top journals said that I was challenging the “large body” of mathematics and statistics. And the editor-in-chief of another flagship magazine simply responded: “Your article is not suitable for publication.” Of course, I understand what he meant, and he obviously understood what I said in my manuscript paper.
事实上,在遭到第一个期刊的拒绝后,我就意识到这篇文章没能从基础概念上阐明为何这类最优化是根本错误的,因为长期以来统计学一直缺乏一些必要的概念。于是,我很快就下定决心对统计学的基本概念系统进行改革。而早在1998年3月自己独立思考基于单一临界点的分段回归问题时,就已在几个重要概念上取得了突破,并体现在了这篇被多次拒绝的稿件中。它们是我继续思考整个新概念系统的基础。
In fact, after being rejected by the first journal, I realized that the article failed to explain from a basic conceptual perspective why this type of optimization is fundamentally wrong, because statistics had long been lacking some necessary concepts. So, I quickly made up my mind to reform the basic concept system of statistics. As early as March 1998, when I was thinking independently about the problem of piecewise regression based on a single threshold, I had made breakthroughs in several important concepts, which had been reflected in this manuscript that was rejected many times. They are the basis for me to continue thinking about the entire new conceptual system.
这些新概念大致成形于2006年9月至2008年2月,包括常量期望、随机对应、随机变量的9个基本性质以及基于此上的7个公理性陈述。尤其是关于随机常量和常量期望的概念和定义,我认为它们在统计学里非常重要,它们是一类变异性为0的随机量,因此,它们在统计学中的重要地位堪比数字系统中的0。于是,我于2009年8月第三次参加了在华盛顿特区召开的JSM,这次的报告除了提交这些新的基本概念外,还略微改进了2007年会议上提出的三分回归分析法的算法。会后,这套概念曾被张贴在一个统计网站(www.mitbbs.comstatistics)上,一位在美国某大学攻读统计学PhD学位的学生看过后评论道:“令人茅塞顿开!”令人遗憾的是,该网站在几年前被迫关闭,所有内容已不可访问。
These new concepts are roughly took shape between September 2006 and February 2008, including constant expectation, random correspondence, nine properties of random variable and seven axiomatic statements. Especially regarding the concepts and definitions of random constant and constant expectation, I think they are very important in statistics. They are a type of random quantities with variability of 0. Therefore, their importance in statistics is comparable to 0 in the digital system. So, I participated in the JSM held in Washington DC in August 2009 for the third time. In addition to presenting these new basic concepts, this report slightly improved the algorithm of the FGTLR proposed at the 2007 JSM. After the meetings, this set of concepts had been posted on a statistics website, which has been shut down (www.mitbbs.comstatistics). After reading it, a student studying for PhD degree in statistics at a university in the United States commented: “It’s so enlightening!” Sadly, the site was forced to shut down a few years ago, and all content is no longer accessible.
在这次会议之前,我找到了胡贝尔博士在其演讲中提到过的、约翰·图基(John Tukey)于1962年发表在《统计年鉴》上的长文“数据分析的未来”,其中有个小节的标题是“最优化的危险”。遗憾的是,由于概念的缺乏,他既未罗列一些这种危险的表现,也未说明为何最优化是危险的。相反,他事实上是赞成最优化的,他只是担心追求最优化会窒息数据分析中新思想的萌芽。其实,在有了上述新的基本概念后,我们将不难理解,这种最优化不只是一种危险,而是一种根本性错误,因为它是确定性数学中的函数极值思维在统计学的随机系统中的滥用。此外,在本书的概念讨论中我们还将发现,在一个样本中,其极值是最不稳定和最不可靠的测量。因此,重温图基博士的文章和胡贝尔博士在中科院的演讲,不得不为他们的担忧和批评表达一种敬意,因为在那个统计学领域的最优化思维刚刚兴起的时代,在同行们看来各种最优化法已经是被广泛认可的、规则化了的、千真万确的科学思维和手段的时候,他们却看到了某种不好的东西在阻碍着统计学的思想和方法的进步。我因此而深刻地相信在他们的灵魂深处一定潜藏着某种敏锐的东西,而这种敏锐性的价值难以估量。
Before this meeting, I found John Tukey’s long paper The Future of Data Analysis published in The Annals of Statistics in 1962, and mentioned in Dr. Huber’s speech, in which there is a section titled “Danger of Optimization”. Regrettably, due to the lack of concepts, he did not list some manifestations of this kind of danger, nor did he explain why optimization is dangerous. On the contrary, he was actually in favor of optimization. He was only worried that the pursuit of optimization would stifle the buds of new ideas in data analysis. In fact, after having the above new basic concepts, it will not be difficult for us to understand that it is not only a danger but a fundamental mistake, because it is an abuse of the function-extremum thinking from deterministic mathematics when applied to a statistical random systems. Furthermore, we will discover in the conceptual discussions in this book that the extreme values ??are the most unstable and unreliable measurements in a sample. So, revisiting Dr. Tukey’s article and Dr. Huber’s speech at the Chinese Academy of Sciences, I have to pay tribute to their concerns and criticisms, because in that era when optimization thinking in the field of statistics was just emerging, and at the time in the eyes of colleagues that the various optimization methods have been widely recognized, regularized, and true scientific thinking and approaches, they did see something bad for holding back the progress of statistical thinking and methods. I therefore deeply believe that there must be some sensitive things lurking in their souls, and the value of the sensitivity is immeasurable.
在随后的几年中,我逐渐回想起了当年在同济医科大学读硕士学位时,卫生统计学教授董时富先生曾给我们专门讲授过随机变量和常量等之间的运算。例如,两个随机变量之间的算术运算结果依然是随机变量。而一个随机变量与一个常量之间的运算结果也是一个随机变量。因此,那些用样本数据构建的所谓最优化算子也都必然是随机变量。对这段受教经历的回忆和认识强化了我后来在这个问题上绝不妥协的立场。我宁可让那篇文章在JSM的论文集里睡大觉,也不会迁就统计学体系的当前范式。
In the following years, I gradually recalled that when I was studying for my master’s degree at Tongji Medical University, Mr. Dong Shifu, a professor of Health Statistics, had taught us about operations between random variables and constants. For example, the result of an arithmetic operation between two random variables is still a random variable. And the result of an operation between a random variable and a constant is also a random variable. Therefore, those so-called optimization operators constructed using sample data must also be random variables. The recollection and understanding of this being taught experience strengthened my later uncompromising stance on this issue. I would rather let that article sleep in the JSM proceedings than accommodate the current paradigm of the statistical system.
然而,即便我已经做了上述努力,一个重大的问题依然困扰着我,这就是关于连续随机变量的期望估计,因为在上述广义三分回归分析中我用全域模型绝对残差的算术均数和分段模型合并绝对残差的算术均数构建了一个加权估计临界点的权重,即式(2)。然而,这两个绝对残差的分布都是偏态的,其算术均数对它们的分布央位的期望估计应该都存在着偏倚,而这些偏倚应该会导致临界点的加权估计的偏倚。按照统计学界目前的共识,样本的算术均数对于偏态总体的算术均数是一个无偏估计,但人们也一致认同算术均数对偏态总体的代表性并不好。换句话说,如果一个偏态总体的分布央位不是其算术均数,则样本算术均数很可能就是关于该偏态总体分布央位的一个有偏期望估计。所以,我需要找到一个新的算法,使得包括正态和偏态在内的所有常见抽样单峰分布能在同一算法下得到关于其总体分布央位的无偏期望估计。如果我能找到它,将对整个统计学的理论基础和方法论产生难以估量的影响。
However, even if I had made above efforts, a major problem still plagued me, which was about the expectation estimate of continuous random variable, because in the above generalized trichotomic regression analysis, I used the arithmetic mean of the absolute residuals of a fullwise model and the arithmetic mean of the combined absolute residuals of a set of piecewise models to construct a weight, i.e., Formula (2), for weightedly estimate the threshold. However, the distributions of the two absolute residuals are all skewed, and their arithmetic means should have biases in their expected estimates for their distribution centers, and these biases should lead to biases in the weighted estimates of a threshold. According to the current consensus in the society of statistics, the sample arithmetic mean is an unbiased estimate of the arithmetic mean of a skewed population, and while people also agree that the representativeness of arithmetic mean is not good for a skewed population. In other words, if the distribution center of a skewed population is not its arithmetic mean, then the sample arithmetic mean is likely to be a biased estimate of the expected center of the skewed population distribution. So, I need to find a new algorithm that enables all common sampling unimodal distributions, including normal and skewed, to get unbiased estimate for the central expectation of their population distributions under the same algorithm. If I can find it, it will have an incalculable impact on the foundation and the methodology of Statistics.
我在2007年JSM后不久便开始了思考如何准确估计偏态分布的峰顶,当年的那个梦想也再次在脑海中浮现,思考的焦点最后落在峰顶两侧的密度变异不一致的问题上,并且相信正是这个不一致或失衡导致了分布峰顶向一侧偏移,因此,要想准确估计峰顶的位置,算法就要考虑两侧的密度变异,而这种变异应该与每个点在样本空间中的位置有关。这个思考过程最终导致了我必须在一个给定的样本最大可测空间内全面计算每个样本点对包括其自己在内的所有样本点的差异性和相似性。
Shortly after JSM in 2007, I started thinking about how to accurately estimate the peak of a skewed distribution. The dream from that year reappeared in my mind. The focus of my thinking finally fell on the problem of inconsistent density variation on both sides of the peak. I believed that it was the inconsistency or imbalance that caused the peak of the distribution to shift to one side. Therefore, in order to accurately estimate the position of the peak, the algorithm must consider the density variations on both sides; and the variations should be related to the position of each point in a sample space. This thinking process eventually led me to comprehensively calculate the differentialities and similarities for each sample point to all sample points including itself within a given maximum measurable sample space.
大约从2009年5月起,我有机会为USUHS的预防医学系流行病学和生物统计教研室的副教授詹尼弗·茹塞茨基(Jennifer Rusiecki)博士工作,不久便接触到基因数据的统计分析。在一个包含有1500多个基因的病例-对照实验数据中,我需要找出一些有统计显著性的基因。这个工作在很多从事基因数据统计分析的人们看来,似乎是一件很容易的工作,因为有现成的方法论和统计软件,将样本数据在软件里运行一下就可以得到结果。
Starting around May 2009, I had an opportunity to work for associate Professor Jennifer Rusiecki, Ph.D., in Department of Preventive Medicine Division of Epidemiology and Biometrics at USUHS, and soon became exposed to statistical analysis for genetic datasets. In a dataset of case-control experiment including more than 1500 genes, I need to find some statistically significant genes. In the eyes of many people who are engaged in statistical analysis of genetic data, this job seems very easy, because there are ready-made methodologies and statistical software, and the results can be obtained by running the sample data in the software.
但是,我发现对全部基因无论是采用t检验或秩和检验,或者根据正态性检验结果将两种方法混合使用,其结果都将产生难以预测和控制的偏差,也就是导致基因的筛选出现偏差,而且,这些偏差都包含着随机误差和系统误差。我意识到自己在这里遇到了一个统计方法学上的瓶颈,而打破这一瓶颈以消除这些偏差的唯一办法只能是抛弃正态性假定,并采用一种统一的算法对连续性抽样分布的期望作出无偏估计。这个期望也被称为这类分布的央化位置,而这样的央位应该对应着包括正态(对称的)和偏态(非对称的)在内的所有常见单峰分布的峰顶。
However, I found that whether I applied t-test method or rank-sum test method, or mix the two methods based on the normality test results for all genes, the results will produce biases that are difficult to predict and control, which will lead bias to the gene screening, and these biases all contain random errors and systematic errors. I realized that I encountered a methodological bottleneck in statistics here, and the only way to break this bottleneck and eliminate these biases was to abandon the normality assumption and adopted a unified algorithm to make an unbiased estimate of the expectation for continuous sampling distributions. This expectation is also referred to as the centralized location (or center) of such distributions, and this center should correspond to the peak of all common unimodal distributions, including both normal (symmetric) and skewed (asymmetric) distributions.
2010年9月的某一天,一个崭新的思想萌芽闯入了我的脑海。经过一段时间的思考、计算、修正和比较,一个关于单峰分布期望估计的算法终于在当年的12月12日形成,即所谓的关于连续随机变量的自权重。该算法仅涉及最基础的数学四则运算,通过一个严谨而巧妙的逻辑分析组合而成。在此基础上,自加权期望以及所有其它必要的统计量都可被轻易获得。如果说在算术均数的计算中默认每个样本点对分布央位的贡献相同,那么,自权重的获得将会告诉我们每个样本点的这一贡献可以在[0, 1]区间随机变化,距离分布央位越近则贡献越大,反之就越小。从样本测量值与其自权重构成的二维散点图来看,这个自权重直观地展示出了一个分布的离散趋势和集中趋势。我们还将发现,由于自权重可以帮助我们将一个分布的央位估计在其分布曲线的峰顶处成为无偏估计,一个偏态分布将总是可以被正态化,而且这个正态化的分布与其原始分布拥有相同的期望、方差和可测空间,从而,目前存在于统计学理论基础中的正态性假定就成了累赘,而单峰分布将可以取而代之。进一步地,我们应该可以将该算法从对单峰分布的峰顶估计拓展到对一切连续随机变量的分布央化位置的估计。
One day in September 2010, a new idea splined into my mind. After a period of thinking, calculation, correction, and comparison, an algorithm for the expectation estimate of unimodal distributions was finally formed on December 12 of that year. This is the so-called “self-weight” of continuous random variables. The algorithm only involves the four most basic mathematical operations and is combined through a rigorous and clever logical analysis. Based on the self-weight, a self-weighted expectation as well as all the necessary statistics can be easily obtained. If it is said that in the calculation of the arithmetic mean, the contribution of each sample point to the distribution center is the same by default, then the acquisition of the self-weights will tell us that this contribution of each sample point may vary randomly in the interval [0, 1], the distance of a point closer to the distribution center, the bigger the contribution of it, and the smaller the contrary. In a two-dimensional scatterplot composed of sample measurements and their self-weights, the self-weighting visually shows the dispersive tendency and centralized tendency of a distribution. We will also find that, since the self-weights can help us to make the estimate of the distribution center unbiased at the peak of a distribution curve, a skewed distribution can always be normalized, and this normalized distribution has the same expectation, variance and measurable space as its original distribution, thus, the assumption of normality that currently exists in the statistical theory becomes a redundant, and the unimodal distribution will be instead of it. Furthermore, we should be able to extend this algorithm from estimating the peak of a unimodal distribution to estimating the distribution center position of all continuous random variables.
对该算法的初步验证是在算法构建过程中追求得到一个二维空间的正态形的散点分布,也即用一个样本量为100的近似正态样本,如果样本点与其自权重之间的散点分布呈正态形曲线,则算法应该是正确的,反之则应该是错误的。最终,我达到了目的。为了进一步验证该算法在大样本下的表现,我用了一个2480例的左偏态分布样本,在计算出其自权重后,显示出良好的左偏态散点分布,其凸自加权均数正好对应着峰顶所在位置,而其算术均数位于峰顶的右侧(见书中的图6.4.5)。由此可以推断,如果该样本呈右偏态分布,其算术均数应该会出现在峰顶的左侧。随后,我用系统抽样的方法从该样本中提取31例,即1/80的原样本量,计算其自权重和凸自加权均数,结果显示出对原样本峰顶极好的估计。最后,我做了一个10万个正态分布样本点的随机模拟试验,其散点分布就是本书封面上偏左上的那个近似正态曲线的图形。这一切均表明自权重的正确性、可靠性和准确性。于是,我决定带着它参加将于2011年8月在迈阿密召开的JSM,并决定借此机会进一步完善发布于2009年JSM的那套基础概念系统。在完成了上述工作后,我深深地感觉到,一扇新的大门已经在统计学领域被悄然地推开了。
The preliminary verification of the algorithm is to strive to obtain a normal scatter distribution in a two-dimensional space during the algorithm construction, that is, using an approximate normal sample with a sample size of 100. If the scatter distribution between the sample points and their self-weights presents a normal curve, the algorithm should be correct, otherwise it should be wrong. Finally, I achieved my goal. To further verify the performance of the algorithm under large samples, I used a left-skewed distributed sample of 2480 cases. After calculating its self-weights, it showed a good left-skewed scatter distribution. The convex self-weighted mean just corresponds to the location of the peak, and the arithmetic mean is located on the right side of the peak (see Figure 6.4.5 in the book). From this, it can be inferred that if the sample is right-skewed, its arithmetic mean should appear on the left side of the peak. Then, I used the systematic sampling method to extract 31 cases from the sample, that is, 1/80 of the original sample size, and calculated its self-weight and convex self-weighted mean. The results showed an excellent estimation of the peak of the original sample. Finally, I did a random simulation trial of 100,000 normally distributed sample points, and its scattered distribution was exactly the figure that approximated the normal curve on the upper left of the front cover. All of these shown the correctness, reliability and accuracy of the self-weight. So I decided to take it to the JSM, which will be held in Miami in August 2011, and would take this opportunity to further improve the basic conceptual system released at the JSM 2009. After completing the above work, I deeply felt that a new door had been quietly opened in the field of statistics.
茹塞茨基博士在了解了我的基本思想后,在2011年的JSM会议即将召开前,安排我在USUHS做了一次讲座,有视频得以录制并发布在视频分享网站Youtube上 (Self-weight —— A New Horizon of Statistics by Ligong Chen)。
After understanding my basic ideas, Dr. Rusiecki arranged for me to give a lecture at USUHS before the 2011 JSM ??conference. The video was recorded and posted on the video sharing website Youtube (Self-weight —— A New Horizon of Statistics by Ligong Chen).
然而,正是那份来自德国出版商的约稿信以及那些年中与不同学者的几次私下讨论,最终促使我下决心写这本书。由于各种原因,我直到2017年的年中才开始了构想这本书的框架,并继续搜集和学习了一些相关文献。在经历了2019年3月《自然》杂志发表那篇文章所反映的统计学领域令人不安的现实后,在稍后与Blake McShane博士讨论的同时,我终于开启了以纯中文进行的思考和写作进程。而就在当年的5月,整个家庭便因突发变故进入从马里兰搬家到印第安纳的程序,写作不得不被迫暂时中止。
However, it was that invitation letter from the German publisher and the several private discussions with various scholars during those years that finally made me decide to write this book. Due to various reasons, I did not start to conceive the framework of this book until mid-2017, and continued to collect and study some relevant literature. After experiencing the disturbing reality in the field of statistics reflected by the article published in Nature magazine in March 2019, while later discussing with Dr. Blake McShane, I finally started the process of thinking and writing purely in Chinese. In May of that year, my whole family entered the process of moving from Maryland to Indiana due to unexpected changes, and the writing had to be temporarily suspended.
当年7月的最后一天终于完成了搬家,而适应一个新环境耗费的时间超过了半年,正是在这段时间里,我通过微信认识了一位在University of Louisville School of Public Health任职的华裔生物统计学教授X博士,和她讨论了算术均数在关于连续随机变量测量分布的统计描述中的问题,她得知我有新的算法后,非常高兴地邀请我去她所在的系做了一次研讨会,也有视频录制和上载到Youtube分享([English] Self-weight of Continuous Random Variable)。这次演讲恰逢2020年中国农历新年的前夜,而第二天就在全球媒体上和全球华人普遍使用的微信群中传出中国武汉发生了严重的新冠病毒性疫情。这一重大历史性事件改变了很多人的行为和命运,也使得我的写作进程几乎完全中断长达近两年。
The move was finally completed on the last day of July that year, and it took me more than half a year to adapt to a new environment. It was during this period that I met a Chinese, Dr. X, a Professor of biostatistics at the University of Louisville School of Public Health through WeChat, and discussed with her the problem of arithmetic mean in the statistical description of the measurement distribution of continuous random variables. After she learned that I had a new algorithm, she was very happy to invite me to give a seminar at her department, and a video was recorded and uploaded onto Youtube for sharing ([English] Self-weight of Continuous Random Variable). This speech coincided with the eve of the 2020 Chinese Lunar New Year, and the next day it was reported in the global media and in the WeChat group commonly used by worldwide Chinese people that a serious new coronavirus epidemic (COVID-19) had occurred in Wuhan, China. This major historical event changed the behavior and destiny of many people, and also almost completely interrupted my writing progress for nearly two years.
本书的第一至第四章最早在1998年就开始了撰写,并在多年前就已有了比较完整的初稿,现在需要的是对其中部分内容予以更新,并融入一些新思想。就在涉及自权重的第六章结稿前,我突然产生了一个新的疑问:连续随机变量的凸自加权均数与其算术均数和中位数的关系是怎样的?我无法从数学论证的角度抽象地探讨这些关系,于是决定用一个很笨的办法——枚举法——来直接查证它们。我从样本量n = 2开始计算其自权重和凸自加权均数,于是发现此时的凸自加权均数就是算术均数,也就是说,凸自加权均数在样本量为2时自动退化为算术均数,或者说,算术均数是样本量为2时凸自加权均数的一个特例。进一步地,我将n逐一增加到3、4、5、6,于是得到中位数是样本量分别为3和4时凸自加权均数的特例。而当样本量达到5或以上时,凸自加权均数就是它的常规算法。以上样本量的设置均应保证任意两两数据点在数值上不等。对这些关系的直接枚举查证表明凸自加权均数可以统一算术均数和中位数的计算,这进一步强化了凸自加权均数作为通用算法的地位。在查证完这些关系后,再反思为何样本量为2时凸自加权均数就是算术均数,这才形成了对算术均数的一个深刻理解,它被推广到任意样本量的计算是一个未加审慎考虑的轻率之举。
Chapters 1 to 4 of this book were first written in 1998, and a relatively complete draft was completed many years ago. What is needed now is to update some of the content and incorporate some new ideas. Just before the finalization of Chapter 6, which involves self-weights, I suddenly had a new question: What is the relationship among the convex self-weighted mean of a continuous random variable and its arithmetic mean and median? I couldn’t abstractly explore these relationships from the perspective of mathematical arguments, so I decided to use a very stupid method — Enumeration — to directly verify them. I started calculating the self-weight and convex self-weighted mean from the sample size n = 2, and found that the convex self-weighted mean at this time is exactly the arithmetic mean, that is, the convex self-weighted mean automatically reduces to the arithmetic mean when the sample size is 2, or the arithmetic mean is a special case of the convex self-weighted mean when the sample size is 2. Further, I increased n to 3, 4, 5, and 6 one by one, and found that the median is a special case of the convex self-weighted mean when the sample size is 3 and 4 respectively. When the sample size reaches 5 or more, the convex self-weighted mean is its regular algorithm. The above sample size settings should ensure that any two data points are not equal in value. The direct enumerative verification of these relationships shows that the convex self-weighted mean can unify the calculation of the arithmetic mean and the median, which further strengthens the status of the convex self-weighted mean as a universal algorithm. After verifying these relationships, reflecting on why the convex self-weighted mean is the arithmetic mean when the sample size is 2, this forms a deep understanding of the arithmetic mean. It is a rash move without careful consideration to extend it to the calculation of any sample size.
在第八章的写作中遇到的几个问题值得在此分享。在将自权重和凸自加权均数引入相关和回归分析后,在案例分析中我发现了三个基本现象。一是两个可变属性间的相关系数在数值上与基于其算术均数的相关系数几乎一致,在大大超出样本可测精度的小数点后很远才看到了差异性。这从一个特殊角度表明相关系数与两个可变属性各自的分布形态无关。二是回归系数在数值上将会受到因变量和自变量各自分布形态的影响,两者的分布对称性越好,基于凸自加权均数的回归系数与基于算术均数的回归系数在数值上就越趋于一致,反之则差异越大。这也从一个特殊角度说明,回归模型的参数估计与分布形态有关。第三,基于凸自加权均数的回归模型将输出一个以非零为中心的残差分布,这意味着这类回归模型不可预先假定残差满足以零为中心的分布。但是,如果将总和非零的残差简单地按样本量平均,然后将这个平均量与常数项合并,则得到一个简化的新模型,且该模型的残差将以零为中心分布。因此,基于凸自加权均数的回归模型无需假定残差的分布特征,而是可以通过算法确定残差的分布。这一概念上和算法上的转变将使得回归分析具有更大的灵活性和普适性。
Several issues encountered in the writing of Chapter 8 are worth sharing here. After introducing self-weight and convex self-weighted means into correlation and regression analysis, I found three basic phenomena in the case analysis. First, the correlation coefficient is almost the same as that based on the arithmetic mean in terms of numerical value, and the difference is seen far after the decimal point that greatly exceeds the measurable precision of the sample. This shows from a special perspective that the correlation coefficient is independent of the distribution patterns of the two variables. Second, the numerical value of regression coefficient will be affected by the distribution patterns of the dependent variable and the independent variable. The better the distribution symmetry of the two, the more consistent the numerical values ??of the regression coefficient based on the convex self-weighted mean and the regression coefficient based on the arithmetic mean are, and vice versa. This also shows from a special perspective that the parameter estimation of the regression model is related to the distribution pattern. Third, the regression model based on convex self-weighted mean will output a residual distribution centered on non-zero, which means that this type of regression model cannot pre-assume that the residuals satisfy a zero-centered distribution. However, if the non-zero total residuals are simply averaged by the sample size, and then this average is combined with the constant term, a simplified new model is established, and the residuals of this model will be distributed in zero-centered. Therefore, the regression model based on convex self-weighted mean does not need to assume the distribution characteristics of the residuals, but can determine the distribution of the residuals through algorithms. This conceptual and algorithmic shift will make regression analysis more flexible and universal.
就在本书的写作进展到“第九章 分段回归”时,时间已到了2024年的2月里。此时,我以为这是自己早已深思熟虑的部分,且已有了基于自加权均数的偏态分布期望估计的算法,所以非常自信应该可以得到关于临界点足够准确的估计。于是,将凸自加权均数引入绝对残差的期望估计,便有了全域模型绝对残差的凸自加权均数MARc和第i组分段模型合并绝对残差的凸自加权均数mcari,c。于是,将原本用算术均数构建的残差收敛率更改为如下算式:
Just as the writing of this book progressed to “Chapter 9: Piecewise Regression”, it was already February 2024. At this moment, I thought that this was a part that I had already thought about carefully and profoundly, and I already had the algorithm for estimating the expectation of skewed distributions based on the self-weighted mean, so I was very confident that I could get a sufficiently accurate estimate of the threshold. Then, by introducing the convex self-weighted mean into the expectation estimate of the absolute residuals, I have the convex self-weighted mean MARc of the absolute residuals for the fullwise model and the convex self-weighted mean mcari,c of the combined absolute residuals for the ith group of pieceise models. So, the convergence rate of residuals originally constructed with the arithmetic means is changed to the following formula:

然而,在500次随机模拟的编程计算(这让我的11台旧电脑连续工作了22天,其中三台在算完后不久主板报废)中发现,按照式(3)为回归权重计算得到的加权临界点的估计值依然明显偏离了直觉上临界点所在的位置。尽管直觉对于随机系统是非常不可靠的,但这一直觉判断顿时令我大惑不解!我甚至开始对自加权的算法失去了信心,怀疑它以及我为它已经付出的数年光阴到底价值几何?一丝不安和焦虑也袭上了心头。我于是暂停了写作和数据分析去修理自己收藏的一些古旧破烂小提琴及其弓子。
However, in the programming calculation of 500 random simulations, which allowed my 11 old computers to work continuously for 22 days, and three of them had their motherboards fail shortly after the calculation was completed, it was found that the estimated value of the weighted threshold calculated based on the regressive weights defined in Formula (3) still obviously deviated from the intuitive location of the threshold. Although intuition is very unreliable for random systems, this intuitive judgment immediately puzzled me! I even began to lose confidence in the self-weighting algorithm, and wondered how valuable it was and how valuable that I had spent the years in it was. I also felt a glimmer of uneasy and anxious. I then stopped writing and analyzing data to repair some old and broken violins and bows in my collections.
在三个多月里,我一边修琴一边冥思苦想这个问题。在修理完工3把琴和15把琴弓后,我终于悟出了问题所在:那个权重的构建仅仅使用了残差。但是,一个回归模型中还有预测值。忽视预测值的变异,也就是丢弃对期望临界点有贡献的一部分样本信息。于是,类似关于连续随机变量自加权的算法构建,我需要将预测值的变异也引入到关于期望临界点的权重构建中。我随即终止修琴,打开电脑修改SAS程序并重新计算。这次还找附近的两位朋友借了两台旧电脑,十几台电脑连续运行了24个日夜后,我终于得到了关于那些模拟临界点更准确的估计。至此,分段回归的算法在我看来终于完美地构建成功。它再次体现了权重构建中的两条基本准则:无信息损失,无信息冗余。
For more than three months, I was repairing the violins while thinking hard about that issue, after I finished repair jobs on 3 violins and 15 bows, I finally figured out the problem: the weights were constructed using only residuals. However, there are also predicted values ??in a regression model. Ignoring the variation of predicted values ??means discarding part of the sample information that contributes to the expected threshold. Therefore, similar to the algorithm construction of self-weighting for continuous random variables, I needed to introduce the variation of predicted values ??into the weight construction for the expected threshold. I immediately stopped repairing jobs on violin stuffs, turned on the computers, modified the SAS program and recalculated. This time I borrowed two old computers from two friends nearby. After more than a dozen computers ran for 24 consecutive days and nights, I finally got a more accurate estimate of those simulated thresholds. At this point, the algorithm of piecewise regression was finally successfully constructed perfectly to my point view. It once again embodied the two basic principles in weight construction: no information loss, and no information redundancy.
2025年2月,在联合统计年会文章摘要投稿的最后一天,我提交了自己的文章《分段模型连续性的直接概率测量和推断》的摘要。这是我在分段回归领域对此前我留在JSM论文集中的算法所作的最后一次修正。不久,我收到了会议组委的通知,该文章被数理统计学会(IMS)接纳并安排在8月4日的“统计推断进展”小组做口头演讲。因此,我以独立的个人身份参加了8月初在田纳西州纳什维尔市召开的联合统计年会。
In February 2025, on the last day of abstract submissions for the Joint Statistical Meetings (JSM), I submitted the abstract of my paper, “A Direct Probability Measure and Inference for Continuity of Piecewise Models.” This was my final revision for the algorithm in the field of piecewise regression, which I had previously published in the JSM proceddings. Shortly afterward, I received notification from the conference organizr that the paper had been sponsored by the Institute of Mathematical Statistics (IMS) and scheduled for oral presentation at the “Section of Advances in Statistical Inference” on August 4th. Therefore, I participated as an independent individual in the JMS held in Nashville, Tennessee in early August of 2025.
参会期间我听了不少他人的演讲,其中很多人的演讲中都有一个数值型最有化步骤。在听完哈佛大学商学院的年轻教授李凌志博士的演讲后,我私下找到他和他讨论了其算法中的那个最优化步骤。我从随机对应的角度指出这样做在理论上是完全错误的。在听了我的解释后,他恍然大悟,回应称他从未像我这样思考过这个问题。
During the meetings, I listened to many presentations, many of which included a numerical optimization step. After listening to a presentation by Dr. Lingzhi Li, a young professor at Harvard Business School, I privately approached him to discuss that optimization step in his algorithm. I pointed out from the perspective of random correspondence that this approach is theoretically completely flawed. After my explanation, he suddenly understood and responded that he had never thought about the problem in the way I had.
在聆听当代统计学泰斗Robert Tibshirani博士和教授的演讲时,我注意到他的新算法中也有一步数值型最优化,而且他还特别强调了其新算法存在一个严重的过拟合(他用的英文表述是severe overfitting)。在其演讲结束后的提问期间,我第一个举手并得到批准。我请求Tibshirani博士将PPT翻回到那个最优化所在的页面,然后指出正是这个最优化导致了其算法的过拟合。但Tibshirani博士不认同这一说法。我提到了John Tukey在1962年的文章里就警告过最优化的危险性,然后提到我与DeepSeek和ChatGPT的讨论结果。这进一步加剧了争执。会议主持人,华裔统计学家沈小彤教授见状立刻示意我不要继续说下去。是的,当着100多个慕名前来听大师演讲的专家和学者们的面指出其新算法的问题所在是一个很大的冒犯。我只好遗憾地放弃继续阐述原因何在。
While listening to a lecture by Dr. Robert Tibshirani, a leading figure in contemporary statistics, I noticed that his new algorithm also included a numerical optimization step, and he specifically emphasized that his algorithm suffered from a severe overfitting. During the Q&A session after his lecture, I was the first to raise my hand and was approved. I asked Dr. Tibshirani to turn back to the PPT slide containing that optimization and point out that it was this optimization that caused the overfitting in his algorithm. However, Dr. Tibshirani disagreed with this assessment. I mentioned John Tukey’s warning about the danger of optimization in his 1962 article, and then cited the results of my discussions with DeepSeek and ChatGPT. This further escalated the controversy. Upon seeing this, Professor Xiaotong Shen, a Chinese-American statistician and the meeting’s moderator, immediately signaled me to stop talking. Indeed, pointing out the problem with his new algorithm in front of over 100 experts and scholars who had come to hear the master’s lecture was a significant offense. I regretfully had to abandon my further explanation.
事实上,我在此前一天的晚间活动中已有幸遇见并认识了Robert Tibshirani博士,并和他有过几分钟的短暂交流。我向他介绍了自己在统计学领域所做的革命性工作,包括与埃弗农博士的那个关于“随机变量”这个术语的email交流和将其更名为可变属性、连续型随机变量分布期望的凸自加权估计的算法以及算术均数和中位数均可作为特例被统一在该算法之下。他表示对此有兴趣进一步了解,希望我通过email向他提供更多信息。因此,会后不久,我通过email向他解释了那个严重过拟合与其中数值型最优化之间的关系。我相信后者是导致过拟合的唯一原因。
In fact, I had the privilege of meeting and getting to know Dr. Robert Tibshirani during the evening event the previous day, and had a brief conversation with him for a few minutes. I introduced him to the revolutionary works I had done in the field of statistics, including my email exchange with Dr. Evernon about the term “random variable” and how I renamed it to “variable attribute,” the algorithm of convex self-weighted estimation fro the expectation of the distribution of continuous random variables, and how the arithmetic mean and median can be unified as special cases under this algorithm. He expressed interest in learning more and hoped I could provide him with more information via email. Therefore, shortly after the meeting, I explained to him via email the relationship between the severe overfitting and the numerical optimization involved. I believe the latter is the only cause of the overfitting.
这次会议期间还先后有幸遇到了美国凯斯西储大学医学院人口与健康计量学系的统计学教授付平福博士和加拿大多伦多大学统计科学系的教授周舟博士,还有很多随机遇到的其他同行们,我尽力向他们粗略但系统性地介绍了我在统计学里所做的那些突破性工作。他们也均表示愿意进一步了解。
During the conference, I also had the privilege of meeting Dr. Pingfu Fu, Professor of Statistics in the Department of Population and Quantitative Health Sciences at Case Western Reserve University School of Medicine, and Dr. Zhou Zhou, Professor at the Department of Statistical Science at the University of Toronto, and other statisticians that I met randomly. I didimy best to give them a brief but systematic overview of my groundbreaking works in statistics, and they both expressed their willingness to learn more.
记得中国物理学家张双楠教授在一个公开辩论科学问题的视频中讲过这样一件事,当他在英国求学期间完成了某个问题的研究时,有同事向他提问:“这其中的科学是什么?”他竟突然间对Science这个术语在这一问话中的内涵感到了困惑。他知道什么是物理、什么是化学,但若说一个问题中的科学是什么,他竟一时语塞。我以为,这个“科学”应该意味着发现某种他人未曾发现或即使发现了也无视或不曾有所为的某种存在。例如,“算术均数会偏离偏态分布曲线的峰顶因而对偏态总体的分布期望必然是一个有偏估计”就是一个存在、“非参数检验法会降低对差异的检验精度”也是一个存在。事实上,这两个问题早已被统计学界广泛认可,但却一直被人们忽视。对一个存在的新发现可以成为某种科学探索的起点。
I remember that Chinese physicist Professor Zhang Shuangnan said in a video of a public debate on scientific issues that when he reported a research on a certain issue while studying in the UK, colleagues often asked him: “What’s the science of this?” He was suddenly confused about the connotation of the term science in this question. He knew what physics and chemistry were, but when asked what the science involved in a problem was, he was at a loss for words. I thought that this “science” should mean discovering something that others had not discovered, or they ignored it or did nothing about it even if they discovered it. For example, “the arithmetic mean will deviate from the peak of the skewed distribution curve, so it must be a biased estimate for the distribution expectation of the skewed population” is an existence, and “a non-parametric testing method will reduce the accuracy in testing difference” is also an existence. In fact, these two issues have long been widely recognized by the statistical community, but they have been ignored. A new discovery of a being can be the starting point for some kind of scientific exploration.
一切统计工作都是关于测量、分布和对分布的数学化描述,以及在此基础上发展起来的关于差异性检验和探索随机变量之间关系等的方法学体系。所以,对分布描述方法的改进将极大地影响对差异性检验和关系构建的方法学的改进,甚至可能引发众多方法上的革命。
All statistical jobs are about measurement, distribution and mathematical description of distribution, as well as the methodological system developed on this basis for testing differences and exploring relationships among random variables. Therefore, improvements in distribution description methods will greatly affect the methodological improvements in difference testing and relationship construction, and may even trigger a revolution in numerous methodologies.
统计学是一种认知外部经验世界的工具,但一个有志于从事统计方法应用和研究的人不是简单地依附在这些工具上的工具人。他们必须直接接触样本采集、数据管理和统计分析,只有在大量的数据分析实践中才有可能被激发出富有创造性的灵感。
Statistics is a tool for understanding the external world of experience, but a person who is interested in the application and research of statistical methods is not a tool person who simply attaches to these tools. They must be directly exposed to sample collection, data management and statistical analysis. Only through extensive data analysis practice can they be inspired to be creative.
由于我的教育背景和统计实践经验都极其有限,我不认为我能继续做的更多,但我努力在过去的岁月里一次次超越自我。这一切都是由于在1991年3~5月期间形成的那个梦想,以及在1998年3月底的6天6夜里形成的许多突破性的思想;当然,更应归因于那个认知流程框架以及运行于其中的四维逻辑系统。从1997年11月的那天开始,每一次的思维启动,我都无法预知它将如何走下去,也不知道它将在哪里停下来。但我知道,它一定会形成一些新的思想。最终,从一个莫可名状的“非常态分析”这一星点思想的火花演变成了燎原于本书各章节的熊熊烈焰。
Due to my limited education background and statistical practice experience, I don’t think I can continue to do more, but I have tried to surpass myself again and again in the past years. All this is due to the dream formed between March and May 1991, and the many breakthrough ideas formed in the six days and six nights at the end of March 1998; of course, it should be attributed to the cognitive process framework and the four-dimensional logic system running in it. From that day of November 1997, every time a thinking process started in my mind, I couldn’t predict how it would go on, and I also wouldn’t know where it would stop, either. But I could believe that it would definitely reach a temporary site and form some new ideas. Finally, the spark of an inexplicable “abnormal state analysis” idea has been evolved into a raging fire that spread across the chapters and sections of this book.
思考的过程充满着苦闷和焦虑,有时甚至会经历某种莫名的痛苦,但也能时常望见思维隧道深处的点点星星之光,而一旦某个或某几个光点闪烁出较大的光芒,就有可能将整个隧道照得通明,而思维过程也将在瞬间转变成一股无法阻遏的巨浪,并因此令人在胆颤心惊中体验到某种震撼!这正如中国宋朝诗人陆游(1125-1210)在其《游山西村》中所吟:“山重水复疑无路,柳暗花明又一村”。我想,这应该就是人类的思维能力所能展现出的一种魅力。所以,只要一个人不因循守旧,敢于挑战某种现存的事物或观念,他们就极有可能发现一些新的东西,从而有机会独享这份魅力。我深信更多的人们在穿越这扇新大门后会发现更多新奇的东西,并创造出既属于他们自己,也属于统计学,因而属于整个人类社会的更伟大的未来,因为,统计学是人类能够发现并努力建立的一套认识世界的高级方法论,它对于人类的未来不言而喻。
The process of thinking was often filled with distress and anxiety, sometimes even accompanied by an inexplicable kind of pain. Yet one can also occasionally glimpse faint starlight deep within the tunnel of thought. And once one or several of these points of light begin to shine more brightly, they may illuminate the entire tunnel. In that moment, the thinking process can suddenly surge into an unstoppable wave, overwhelming the thinker with a sense of awe and trembling exhilaration! This is just as the Chinese Song Dynasty poet Lu You (1125-1210) song in his Visit to Mount-west Village: “After endless mountains and rivers that leave doubt whether there is a path out, suddenly one encounters the shade of a willow, bright flowers and a lovely village again.” (The Chinese-to-English translation of the two lines of poetry comes from the speech given by Hillary Clinton, as the US Secretary of State, on May 22, 2010, at the reception of the US Pavilion at the Shanghai World Expo). I believe this should be a kind of enchanting power that human thinking can display. Therefore, as long as a person does not blindly follow tradition and dares to challenge existing things or concepts, they will very likely to discover something new and may have an opportunity to enjoy the enchantment for themselves. I firmly believe that more people will uncover more wonders after passing through this new gateway, and help create a greater future that belongs not only to them but also to statistics, and thus to the whole of human society, for statistics is one of the highest- level methodologies that humanity can find and strive to build for understanding the world, and its importance to our future speaks for itself.
本书力图将作者在过去30多年里在统计学领域的探索和思考作一次总结。虽然已有过那家德国出版社约稿成书,但我希望这本书能以中英文对照的形式出版,因为中文是我的母语,它是人类历史上一种从远古传承至今的文字性语言。这是一种从创始之初即在字符的结构上富有抽象形式的高度发达的人类语言。正是由于其字符结构上丰富多样的抽象形式,汉字系统本身构成了一部具有某种自解释能力的百科全书,而由这些从远古传承至今、相对固定不变的字符系统所创造的语言和思想表达得简练、精致、优雅和深邃,这在人类文明史上恐怕无与伦比。最重要的是,我之前的全部教育背景以及关于这一话题的一切思考都唯一地得益于它。本书的英文翻译几乎全部出自基于网络的Google Translator以及作者对翻译结果的阅读和修订,仅有几个段落的翻译由ChatGPT-4o所为,因此,如果英文翻译的表达存在任何错误或语义不明之处,请以中文表达为准。
This book will attempt to summarize the author’s exploration and thinking in the field of statistics over the past more than 30 years. Although I was invited to publish a book by the publisher in Germany, I hope that this book can be published in Chinese and English, because Chinese is my mother language, which is a kind of literal language in human history that has been passed down from ancient time to the present. It is a highly developed human language that has been rich with its characters structurally in abstract forms from its very inception. It is precisely because of the rich and diverse abstract forms of the character structure that the Chinese character system itself constitutes an encyclopedia with self-explanatory capabilities at some extent. The language and thoughts created by this relatively fixed character system that has been passed down from ancient times to the present are concise, refined, elegant and profound, which may be unparalleled in the history of human civilization. And the most important, my entire educational background and all thoughts on this topic were only benefit from it. The English translation of this book was almost entirely done by the web-based Google Translator and the author’s reading and revision of the translation results. Only a few paragraphs were translated by ChatGPT-4o. Therefore, if there are any errors or semantic ambiguities in the English translation, the Chinese expression shall prevail.
四、各章内容简介 (Introduction to the Chapters)
第一章属于纯哲学认识论的范畴,开篇的三个认知导向隐含着所有统计学方法的三个基本类别:描述、差异性检验、相关与回归。通过对这三个导向的简单叙述,直接展示了抽象思维的工作模式。本章对辩证法认知模型的讨论应该是有所创意的。此外,还将那篇“论智慧的递进结构和认知的逻辑流程”中的内容融合在此。然而,将一个哲学范畴的议题作为本书开篇的目的则不仅是为了强调统计学的“认知方法论”这一重要属性,并以此将其与纯数学拉开一点距离,更重要的是,它是作者在过去的近30年里思考和解决自己所面对的所有统计学问题时可以依赖的最基础的方法论。
The first chapter belongs to a purely philosophical epistemology category. The three cognitive orientations at the beginning imply the three basic categories of all statistical methods: description, difference test, and correlation and regression. Through a simple description of the these three orientations, the working pattern of abstract thinking is directly demonstrated. The discussion of the dialectical cognitive model in this chapter should be creative. In addition, the content from the article “On the Progressive Structure of Intelligence and the Logical Process of Cognition” is also integrated here. However, the purpose of opening this book with a philosophical topic is not only to emphasize the important attribute of “cognitive methodology” of statistics and in order to distance it from the pure mathematics, but more importantly, it is the most basic methodology that the author could have relied on when thinking about and solving all statistical problems that I have faced in the past nearly 30 years.
第二章是关于统计学的历史概貌,通过这一回顾,对正态分布、算术均数、贝叶斯法,以及现行的众多基于最优化和强制连续性假定的分段回归提出了几个关键且合理的批评,从而为本书在后续章节中提出关于自权重和加权分段回归的算法打下重要的思想基础。这里有必要一提的是,书中对贝叶斯法的批评立足于概率论本身的一个定理,而非自其诞生以来众多批评者和支持者所陷入的那些关于主观-客观、先验-后验-经验等争论不休的纯哲学式思辨。
Chapter 2 is about the brief historical review of statistics. Through this review, several key and reasonable criticisms are made on the normal distribution, arithmetic mean, Bayesian method, and many current piecewise regressions based on optimization and enforced continuity assumption, thereby laying an important ideological foundation for the algorithms proposed by this book on self-weighting and weighted piecewise regression in subsequent chapters. It is necessary to mention here that, the criticism of Bayesian method in the book is based on a theorem in probability theory itself, rather than a pure philosophical speculation as those constant debates between many critics and supporters who fallen into about subjective-objective, priori-posterior-experience since its birth.
第三章试图重建统计学最基础的概念系统。作者认为这个概念系统非常重要,它构成了思考和解决一切统计学问题时最底层的逻辑。理解了这套概念系统,一个人才能较好地驾驭统计学这门方法论;反之,则有可能犯下错误而不自知。
Chapter 3 attempts to reconstruct the most preliminary conceptual system of statistics. The author believes that this conceptual system is very important. It constitutes the lowest level logic when thinking and solving all statistical problems. Only by understanding this conceptual system can one better control the methodology of statistics; otherwise, he/she may make mistakes without realizing it.
第四章是作者的一个尝试,很简短,也很不成熟,但是希望它能在第三章和第五章之间搭建一座桥梁,以便承上启下。在作者看来,统计学针对的是测量、分布以及对测量分布的描述和分析,因此,尺度在统计学中本应是一个非常重要的概念或话题。只不过鉴于作者的学识有限,无法深入和展开。对此感兴趣的读者应该会找到阐述之道,而不感兴趣者可以忽略。
Chapter 4 is an attempt by the author. It is very short and immature. But I just hope it can build a bridge between Chapter 3 and Chapter 5 to connect the previous and the next. In the author's opinion, statistics is about measurement, distribution, and the description and analysis of measurement distributions, so scale should be a very important concept or topic in statistics. However, due to the author’s limited knowledge, I could not go into depth and expand. Those readers who are interested should find their way to explain, while those who are not interested can ignore it.
第五章讨论的是纯数学领域的概率论,写作中借鉴了高世泽教授编撰的《概率统计引论》一书,这是为了在基于自权重的加权期望估计与正态分布、大数定律、中心极限定理等之间架设一座理性的桥梁。人们在这些领域的历史性探索和贡献在自加权统计量方面依然拥有极强的生命力,而且,能被轻松拓展到一切具有央化位置的分布之中,而无需一个“满足正态性分布的假定”作为它们在理论上成立的前提。
Chapter 5 discusses probability theory in the field of pure mathematics. The writing draws on the book “Introduction to Probability and Statistics” compiled by Professor Gao Shize. This is try to establish a rational bridge between the weighted expectation estimation based on self-weight with the normal distribution, the law of large numbers, and the central limit theorem, etc.. Peoples’ historical explorations and contributions in these fields still have strong vitality in terms of self-weighted statistics, and can be easily extended into all distributions with centralized location without the need for an “assumption of normal distribution” as the premise for their theoretical establishment.
从第六章开始才进入本书的关键话题,其中最重要的是详细阐述了关于连续随机变量的自权重的构建和算法,由此我们可用基于自权重的常用统计量来描述抽样分布的特征。正如本书封面的左上图所示,一个涉及服从正态分布的10万个随机模拟样本点的试验用散点图的方式显示出该自权重算法的正确性。
The key topics of this book is not entered until Chapter 6 is beginning, the most important of which is the detailed explanation of the construction and algorithm of the self-weight of continuous random variables. From this, we can use the self-weight-based common statistics to describe the characteristics of a sampling distribution. As shown in the upper left picture on the cover of this book, an experiment involving 100,000 random simulation sample points obeying a normal distribution shows the correctness of the self-weight algorithm in the form of a scatter plot.
第七章在引入自权重和自加权均数的基础上讨论差异性检验,仅选择了最简单的t检验法、方差分析法、非参数的秩和检验等。作者提出了对t值的调整算法以规避方差齐性检验。
Chapter 7 discusses the differential test, and only chooses the simplest t-test method, analysis of variance method, non-parametric rank sum test, etc., by introducing self-weights and self-weighted mean. The author proposed an adjustment algorithm for the t value to circumvent the homogeneity of variance test.
第八章讨论了简单直线回归、多项式曲线回归、多维线性回归以及对数率比回归等常用统计模型,其中,通过将离散型因变属性改为数值型连续可变属性而将对数率比回归模型的算法改为常规线性回归模型的算法应该是本书所做的一个大胆创新。讨论这些模型的目的正是为第九章的分段回归奠定基础。
Chapter 8 discusses common statistical models such as simple linear regression, polynomial curve regression, multidimensional linear regression, and logistic regression, in which it should be a bold innovation in this book for converting the algorithm of logistic regression to the algorithm of a conventional linear regression by transforming the discrete dependent vattribute into a numerically continuous vattribute. The purpose of discussing these models is to lay a foundation for the piecewise regressions in Chapter 9.
第九章是关于分段回归,在引入自权重后重建了过去20多年里由本作者提出的加权临界点估计的算法,由此解决了加权分段回归分析中最关键的难题。本书封面右下方的那个黄色分布曲线极好地展示了该算法在随机模拟500个样本(共计17500个随机点)中对500个临界点的估计的收敛性和准确性。
Chapter 9 is about the piecewise regressions. After introducing self-weight, the weighted threshold estimation algorithm proposed by the author over the past 20 years is reconstructed, thereby solving the most critical difficult problem in the weighted piecewise regression analysis. The yellow distribution curve on the lower right side of the cover of this book excellently demonstrates both the convergence and accuracy of the 500 thresholds estimated by the algorithm in a random simulation with 500 samples (17,500 random points in total).
第六~九章涉及的案例分析均采用了前后对比的篇章结构,以展示当前基于算术均数的统计算法与基于凸自加权均数的统计算法对同一案例的差异性。在为每个案例展示必要的原始数据和中间计算过程的同时,作者尽可能地制作了一些统计图,以便读者能在直观方式下体验两种算法的差异。读者应该能从作者对内容的编排中发现两者孰优孰劣,因为统计学自己就是一门用数据和图表说话的方法论。
The case analyzes involved in Chapters 6 to 9 all adopt a before-and-after sectional structure to show the difference between the current statistical algorithms based on arithmetic mean and the statistical algorithms based on convex self-weighted mean for the same case. While displaying the necessary original data and intermediate calculation processes for each case, the author has tried the best to produce some statistical charts so that readers can experience the differences between the two kinds of algorithms in an intuitive way. Readers should be able to find out which of the two is better from the author’s arrangement of the contents, because statistics itself is a methodology that uses data and charts to speak.
本书名中用了“哲学”二字,是因为自己在过去的几十年思考过程中会经常冒出许多思想火花。我有时会感到很奇怪,为什么会有这么多新问题、新概念、新思想等在不经意间冒出来?我想这可能是个哲学问题。所以在第一章讨论了一些与认识论和逻辑等有关的哲学概念,尤其是关于人类的抽象思维和推理。从作者的个人经验看,它们在统计学的方法论构建中非常重要,因为它们不仅可以帮助我发现统计学这门学科中的问题,也可以帮助我发现自己思维过程中产生的谬误。只有找出了问题,才有可能找到解决问题的路径。因此,我将第一章视为自己在这段探索路径上最基础的方法论,它堪比一切统计方法之母。
The word “philosophy” is used in the title of this book because many sparks of thought had often emerged in my thinking process over the past few decades. I sometimes feel very strange, why were there so many new questions, new concepts, new ideas, etc. popping up inadvertently? I think this may be a philosophical question. Therefore, in the first chapter, some philosophical concepts related to epistemology and logic are discussed, especially about human abstract thinking and reasoning. From the author’s personal experience, they are very important in the construction of statistical methodology, because they can help me discover not only the problems in the subject of statistics, but also the fallacies in my own thinking process. Only by identifying a problem can it be possible to find a solution to the problem. Therefore, I regard the first chapter as the most basic methodology over this exploration path, which is comparable to the mother of all statistical methods.
本书将“随机变量”改称“可变属性”,这是因为,统计学只讨论随机系统中的问题,这个系统中的一切要素都具有“随机性”,因此,统计学针对的“变量”可以无需特别地用“随机”这个形容词来修饰。只有在跨学科讨论时,为避免术语使用上的歧义,才需要用“随机”加以限定。此外,在英语中,“randomly variable + 一个名词”才是术语“random variable”的真实含义,后者不过是前者被简化后的一个变体。由此,我们找到了统计学真正的研究对象。这一概念的更新可被视为统计学向其研究对象的本体的回归。Bootstrap法的奠基人,当代著名统计学家Dr. Efron在回应我这个问题时说:“Random variable does mean ‘something randomly variable’”。那么,那个名词是什么呢?他没能告诉我。我思考良久,也得益于在CPDR/USUHS工作期间管理和运作一个数据库时,该数据库系统正是用了属性(Attribute)来取代传统上使用的“变量(Variable)”。可见,已经有人在用了。
This book renames “random variables” to “variable attributes (and thus simplified to vattribute)” because statistics only discusses problems in random systems. All elements in such a system come with “randomness”. Therefore, a “variable” targeted by statistics is no need to specifically use the adjective “random” to qualify it. Only in interdisciplinary discussions is it necessary to use the term “random” to avoid ambiguity in the use of terminology. Furthermore, in English, the true meaning of the term “random variable” is “randomly variable + a noun”, and the former is just a simplified variant of the latter. Thus, we have found the real research object of statistics. This conceptual update can be regarded as a return of statistics to the ontology of its research object. Dr. Efron, the founder of the Bootstrap method and a famous contemporary statistician, said in response to my question: “Random variable does mean ‘something randomly variable’”. So, what is that noun? He couldn’t tell me. I thought about it for a short time and benefited from the fact that when I was managing and running a database at CPDR/USUHS, the database system just used “Attribute” to replace the traditionally used “Variable”. It can be seen that someone have been already using it.
当然,作者不会忽视随机性,也不会埋没这个术语,而是将它与统计测量的最小单元“个体”相结合组成了一个新术语——随机个体(randomid)。通过这个术语的构建,总体和样本中的随机性被保留在每一个个体之上。
Of course, the author will not ignore randomness or bury this term. Instead, I combined it with the smallest unit of statistical measurement, “individual”, to form a new term — randomid. Through the construction of this term, the randomness in the population and the sample is retained on each individual.
作者在书中将Logistic Regression翻译为“对数率比回归”,这一翻译基于logistic模型的数学构建。此外,还将Bootstrap Method翻译为彼替法。目前中文语境下可见的翻译是“自助法”。个人认为这个翻译有点莫名其妙,词不达意。从Bootstrap法的计算流程看,它就是试图用“另一个东西代替某个需要进行统计处理的对象”。所以,彼替一词比较达意。它是从Bootstrap这个英文单词中取了B和T两个字母的发音而组成的一个统计学的中文术语。
In the book, the author translates Logistic Regression as “log-rate-ratio regression”. This translation is based on the mathematical construction of the logistic model. In addition, the Bootstrap Method is translated into “Another-replacing-it method”. The current translation visible in the Chinese context is “self-help method”. Personally, I think this translation is a bit confusing and the words do not convey the meaning. Judging from the calculation process of the Bootstrap method, it is trying to replace an object that requires statistical processing with another thing. Therefore, the word “彼替(sounds Biti)” is more expressive. It is a statistical Chinese term formed by taking the sounds of the two letters B and T from the English word Bootstrap.
作者将自己编写的关于自权重和几个加权分段回归算法的SAS程序连同几个本书所需的统计量分布表附在书后供读者拷贝和参考使用。我很可能不是一个比较优秀的SAS程序员,但在编程分析本书涉及的数据时应该是合格的,因而可以保证这些程序一定能帮助读者算出正确的结果。
The author has attached the SAS programs written by myself for self-weighting and several weighted piecewise regression algorithms, together with several statistic distribution tables required for this book, to the back of the book for readers to copy and refer to use. I may probably not a good SAS programmer, but I should be qualified when programming to analyze the data involved in this book, so I can guarantee that these codes will help readers calculate correct results.
当然,由于本人对文献的阅读量非常有限,无法涉猎统计学历史上所有他人对其思想和方法论的贡献。如果我在本书中所说的话语在前人的书籍和文献中已存在却未加注明引用,那么,首创那些话语的荣耀属于他们,而我不过是一个碰巧也形成了类似观点的后来者而已。此外,我不认为我在此的所有思考、观点和方法及其语言表达都正确无误。由于本人专业学识非常有限,而且思考所及大大超越了我的医学和公共卫生专业范畴,所以,一些无知、浅薄甚至错误之处在所难免,但愿这些不足和错误能够作为一种催化剂激发他人的睿智和远见,或者一面镜子照亮他们脚前的黑暗,或者作为一段阶梯助他人登上更高的山峰。
Of course, since my reading of the literature is very limited, I am not able to cover all the contributions to its ideas and methodologies made by others throughout the history of statistics. If the words I say in this book have existed in previous books and documents without citation notes, then, the glory of pioneering those words belongs to them, and I am just a latecomer who happened to form the similar opinions. Also, I don’t think all my thoughts, opinions and methods as well as the linguistic expressions here are correct. Since my professional knowledge is very limited, and my thinking is far beyond the scope of my medical and public health professions, some ignorance, superficialities and even mistakes are inevitable. I hope these deficiencies and mistakes can be a catalyst to enlighten others’ wisdom and farsightedness, or a mirror to light up the darkness before their feet, or a ladder to help them reach higher mountains.
中国历史上有过许多伟大的先贤,其中一位在约1400年前曾说过:“以史为镜,可知兴替。”他,就是唐朝的开国皇帝李世民。另一位则在1965年用一首富有浪漫激情却又不失睿智的诗词鼓励其国人道:“世上无难事,只要肯登攀!”他,就是近代中国的历史巨人毛泽东。
There have been many great sages in Chinese history, one of them said about 1400 years ago: “The rise and fall may be knewn by looking at history as a mirror.” He was Li Shimin, the founding empire of the Tang Dynasty. Another encouraged his countrymen in 1965 with a poem full of romantic passion and wisdom: “Nothing is difficult in the world as long as you are willing to climb!” He was Mao Zedong, a historical giant of modern China.
五、个人权利声张 (The Assertion of Personal Rights)
在结束本序之前,作者想谈谈自己对本书中所涉新统计算法的权利。
Before closing this preface, the author would like to talk about my rights to the new statistical algorithms covered in this book.
由于某种长期且普遍存在的蒙昧,学术界将统计学归入数学分支学科,这直接导致了当今世界各主要国家和经济体的专利法也都存在着将统计算法等同于纯数学公式而不允许其获得专利保护的歧视性规定;再考虑到几乎所有统计软件均为商业盈利性产品的现实,在无法通过正当法律途径保障和捍卫个人恰当权益的窘境下,作者不得不以谦卑的姿态在此阐明自己对书中由作者发明、设计、构造和改进的统计算法的权利,因为它们都是具有创新性、实用性和可改进性的统计测量工具,在这一点上,它们与那些纯数学中不可更改的定理和计算公式有着质的差别。作者认为,现在已到了推动修法废除这种侵害他人正当权益、毫无合理性和公正性的歧视性法规的时候。我呼吁广大的统计人团结起来捍卫自己的恰当权益,因为所有的统计方法都浸透着每个创立者的智慧和辛劳,他们理应得到全社会的尊重和法律保护。
Due to some long-term and widespread ignorance, the academic community classifies statistics as a branch of mathematics, which directly leads to some discriminatory regulations in the patent laws of every major countries and economical entities around the world today for equating statistical algorithms with those pure mathematical equations and denying them to be patented, and also by considering the reality that almost all statistical software are commercial profit-making products, in such a dilemma of being unable to protect and defend personal proper rights and interests through legitimate legal channels, the author has to humbly clarify my rights to the statistical algorithms invented, designed, constructed and improved by the author in the book, because they all are innovative, practical and improvable statistical measurement tools. In this respect, they are qualitatively different from the immutable theorems and calculation equations in pure mathematics. The author believes that it is the time now to push for legislative amendments to abolish such unreasonable and unjust discriminatory regulations that infringe on the legitimate rights and interests of individuals. I call on the majority of statisticians to unite to defend our proper rights and interests, because all statistical methods are soaked in the wisdom and hard work of each founder, and they deserve the respect and legal protection of the whole society.
本书所涉及的全部统计算法大致可被分为三类。
All statistical algorithms covered in this book can be roughly divided into three categories.
第一类是作者引用的既有统计算法,例如,算术均数,以及基于算术均数的方差、标准差、t统计量、F统计量、相关系数和回归模型等的算法,等等;又如,基于没有实质含义和算法流程的抽象权重的加权均数的计算公式,等等,本作者不对它们声张任何个人权利。
The first category is the existing statistical algorithms cited by the author, such as the arithmetic mean, and arithmetic mean-based algorithms for variance, standard deviation, t-statistic, F-statistic, correlation coefficient, regression model, etc.; and for another example, the calculation formula of weighted mean based on abstract weights without substantial meanings and algorithmic process, etc. The author will not claim any personal rights over them.
第二类是作者对现有算法的改进,例如,基于凸权均数的离峰差、标准离峰差、t统计量、F统计量、相关系数和回归模型等的算法,等等,作者对此类算法声张和保留自己的权利。
The second category is the author’s improvements for the existing algorithms, such as the algorithms of deviation from peak, standard deviation from peak, t-statistics, F-statistics, correlation coefficients and regression models, etc., that all based on convex self-weighted mean. The author claims and reserves its own rights to such algorithms in this category.
第三类包括那些由作者独立地首创、设计和构造的统计算法,例如,关于连续随机变量的自权重的完整算法、基于凸权均数的正态化、t检验中的t值校正系数的完整算法、三分回归的完整迭代搜索流程、基于全域模型和分段模型的预测和残差的凸权均数的分段模型的回归权重的多种算法的完整流程、基于回归权重的加权期望临界点的完整算法、分段模型在加权期望临界点处的连续性检验的完整算法、分段模型的拟合优度,等等,作者对此类算法声张并保留自己的权利。
The third category includes the statistical algorithms independently originated, designed and constructed by the author, such as the complete algorithm for self-weighting of continuous random variables, normalization based on convex self-weighted mean, the complete algorithm of t-value adjustment coefficient for t-test, the complete iterative searching process for trichotomic regression, the complete process of various algorithms for regressive weights of piecewise models based on the convex self-weighted means of the predictions and the convex self-weighted means of the residuals of fullwise model and the piecewise models, the complete algorithm of the weighted expectation of threshold based on the regressive weights, the complete algorithm for continuity test of piecewise models at the weighted expected threshold, goodness-of-fit of piecewise models, etc. The author claims and reserves its own rights to such algorithms in this category.
以上第二和第三类统计算法,非经作者同意,任何个人或法人实体等均不得将其用作商业盈利之目的,例如将它们中的任何一个编程写入商业化并进入市场销售或租赁的统计软件产品中,也不得将其编程写入非盈利性的免费统计软件产品中;否则,作者将向这些侵权者追究任何形式的侵权责任。
Without the consent of the author, the second and third categories of statistical algorithms above are neither allowed to be used by any individual or legal entity for commercial profit purposes, such as coding any one of them into a statistical software product that is commercialized and put into the market for sale or lease, nor allowed to be coded into any non-profit free statistical software product; Otherwise, the author will pursue any form of infringement liability against these infringers.
这里强调个人权利的声张仅限于对上述各个新算法的完整流程,表明作者主动放弃对这些算法中的非完整计算流程声张个人权利。此外,任何为了统计方法学的研究和改进而引用任一这些被作者声张了权利的算法的行为都不受所声张权利的限制;但是,如果您将自己必须引用这些被作者声张了权利的算法的研究成果用于专利申请或个人权利声张,请注意您所声张权利的界限,不得因此损害本作者的相关权利。
It is emphasized here that the assertion of personal rights is limited to the complete processes of each new algorithm mentioned above, indicating that the author voluntarily gives up the assertion of personal rights to the incomplete calculation processes of these algorithms. In addition, any act of quoting any of these algorithms that the author has asserted its rights for the purpose of research and improvement of statistical methodologies is not subject to the restrictions of the claimed rights; however, if you want to apply patent or claim your personal rights for your research results that must cite these algorithms for which the author asserts its rights, please pay attention to the boundaries of the rights you will claim and do not damage the relevant rights of the author.
此自序写于
This preface was written in
2019年3月9日 ~ 2025年11月9日
March 9, 2019 ~ November 9, 2025
马里兰州洛克维尔和印第安纳州卡梅尔家中
at the homes in Rockville, Maryland and in Carmel, Indiana