关于分段回归的定义以及统计学的基本概念系统
(2010-10-30 21:06:56)
下一个
Ligong Chen's Definition on the Piecewise Regression and
The Basic Conceptual System of Statistics
关于本人(陈立功,Ligong Chern)的三分回归分析法的文章衔接:
http://www.meetingproceedings.us/2009/jsm/contents/papers/303243.pdf
一、什么是分段回归?
What is Piecewise regression?
在统计学中,分段回归分析(Piecewise regression analysis,PRA),或简称分段回归(Piecewise regression),在广义的回归分析(Regression analysis)中是一种方法或分析的策略。它试图在一个被分割的、可连续测量的随机样本空间里找到一个或多个随机的临界点(Critical point,Threshold)以便将整个随机样本空间分割为两个或多个子空间,并在此基础上为每个子空间拟合一个临界模型,从而以一组随机可变的回归模型来描述和预测整个随机空间上复杂的回归关系。有了分段回归分析的方法和技术,我们就有可能依从或改变一个随机空间里的复杂关系以便实现特定的目的。因此,一个广义回归分析的完整策略应该由一个全域回归分析和分段回归分析组成[1]。根据对分段回归的上述定义,我们就不难理解到,它应该是处于整个统计学方法论的顶端位置[2]。
In Statistics, the piecewise regression analysis (PRA) is a method or analytical strategy in general Regression analysis. It is based on finding one or more random critical points or thresholds on a segmented random variable to segment a continuably measured random sample space into two or more sub-spaces in order to describe randomly variable regression relationships in the whole measurable space. With the PRA, we may have an approach to follow or change the relationships in order to realize a particular purpose. Therefore, a complete strategy for the general regression analysis should be composed of a fullwise regression analysis and a piecewise regression analysis. According to the definition, we will understand that the PRA should be at the top of the large body of the methodology in Statistics.
在分段回归分析中拟合的回归模型有时被称为分段模型或临界模型或分割模型。这三个术语应该拥有同一个内涵,或者说它们是同义词。
The regression models fitted in the piecewise regression analysis are sometimes called piecewise models or threshold models or segmented models. All of the three terms should share a same connotation, or they are synonyms.
二、统计学的基本概念系统
The Basic Conceptual System of Statistics
要想准确无误地理解我提出的三分回归分析法的全部内容,几个新的概念需要被引入到现行的统计学和概率论的基本概念系统之中,而且有几个基本概念需要被澄清或甚至被重新定义。
Several basic concepts in Ligong Chen's paper or in his thinking process need to be clarified, and some of them need to be corrected. Due to a limited space of the JSM proceedings, he had no chance to do it. Anyone might feel very difficult when he/she tries to understand the ideas and the method in his paper if there were all the concepts stated here since some existing concepts' connotations have been adjusted and some new concepts are emerged. So, I would like to borrow here to give his explanation.
个体:在认识论范畴内,一个个体是一个独立的存在或实体或客体,且拥有其自身已知的、可知的和不可知的全部属性,并且由于这些属性,一个个体可以与所有其它个体相区别。在一个特定的领域中,任何以最小单元存在着的事物可以被称为是一个个体。当一个个体进入一个主体的观察范畴且能被认知或再认知时,它的每一个属性应该是确定的而非不确定的。换句话说,一个个体是它自己而非任何其它事物是由于它所拥有的全部属性至少在被认知的那一刻是确定的。反之,如果它的全部属性在被主体观察时是不确定的,那么主体将对它不可知,或者说它对于主体来说不可测。
Individual: in the domain of epistemology, an individual is an independent existence, or substance, or entity, or object with all known, knowable and unknown attributes by which an individual can be distinguished from all others. Everything existing as the smallest unit in a specific scope can be called an individual. Every attribute about an individual should be certain rather than uncertain if it can be cognized or recognized, or when it is entered into an observation of a subject. In other words, it is itself rather than anything else because all of its attributes are certain at the moment of cognition or recognition. In contrary, a subject should have no way to know it if its attributes are uncertain in an observation; or it is immeasurable to the subject.
属性:一个个体的一个属性(用符号A(字体:kunstler script)表示)是关于它的一个抽象的特征。这类抽象的特征通常有质和量两大类,由此我们可以在许多个体中定义一个群体或类。例如,一个个体可以有姓字、性别、身高和体重等属性。每一个属性是唯一的并且表达着一个特定的含义。
Attribute: an attribute, denoted by A (kunstler script), is an abstraction of a characteristic of an individual with a specific quality or quantity by which we may define at least one group or category in the individuals; for example, an individual may have a name, gender status, age, height and weight, etc. Every attribute is unique and indicates a specific meaning.
子属性:它是一个附属的属性且被定义在一个属性的名下,例如,姓名={亚里士多德,培根,黑格尔},性别={男,女,性别畸变}以及年龄={介于[0,140]之间的一个数值,如2,35或86岁},等等,其中{亚里士多德,培根,黑格尔}、{男,女,性别畸变}和{2, 35 or 86岁}等是被分别定义在姓名、性别和年龄等名下的子属性。
Sub-attribute: an affiliate attribute is defined under the name of an attribute, for example, Name={Aristotle, Bacon, Hegel}, Gender={male, female, abnormity} or Age={a value that is in the range of [0, 140], i.e. 2, 35 or 86 years old}, etc., where {Aristotle, Bacon, Hegel}, {male, female, abnormity} and {2, 35 or 86 years old} are sub-attributes defined under the name of Name, Gender and Age, respectively.
不变属性:一个属性被认为是不变的,如果(1)它是它自己;或(2)没有子属性可以被定义在其名下;或(3)即使存在子属性,但定义它们是不必要的。从而,这个属性在观察或试验过程中可以被认为是没有变化或变异性的,因而可以被用来清楚地定义一个群体或类别,例如,性别=男,或年龄大于等于18岁,或性别=男且年龄大于等于18岁,等等。
Invariable attribute: an attribute is said to be invariable if (1) it is itself, or (2) there are no sub-attributes that can be defined under its name, or (3) it is unnecessary to define the sub-attributes even if they exist. Thus there is no change or variability on the attribute in an observation or experiment so that it can be used to define a group or category clearly, for example, Gender=male, or Age>=18, or Gender=male and Age>=18, etc.
可变属性:一个属性被认为是可变的,如果在一个观察或试验中至少有两个不同的子属性可以被定义在其名下,且各子属性是可以相互区分且准确定义的,相互之间没有任何混淆和冲突。因此,可变属性的概念等同于现行系统中随机变量的概念,例如,性别={男,女,性别畸变},0岁<=年龄<=140岁,等。
Variable attribute: an attribute is said to be variable if there are at least two different sub-attributes that can be defined under its name in an observation or experiment. Every sub-attribute is distinguishable and can be defined clearly without any confusion and/or confliction with each other, thus the concept of variable attribute is equal to the concept of random variable in the current system, for example, Gender=(male, female, abnormity), 0<=Age<140, etc.
离散可变属性:一个属性被认为是离散可变的,如果定义在其名下的所有子属性是质性的,例如,地点和学校、树木和湖泊、疾病和治疗,等等。
Discretely variable attribute(DVA): an attribute is said to be discretely variable if all the sub-attributes defined under its name are qualitative, for example, locations and schools, trees and lakes, diseases and treatments, etc.
连续可变属性:一个属性被认为是连续可变的,如果定义在其名下的所有子属性是量性的,例如,高度和重量、速度和加速度、容积和比率,等等。
Continuously variable attribute(CVA): a variable attribute is said to be continuously variable if all the sub-attributes defined under its name are quantitative, for example, height and weight, speed and acceleration, volume and ratio, etc.
总体或总体空间:一个总体(用符号P(字体:kunstler script))是由一些有着相同的不变和可变属性的个体组成的一个群体或集合。总体中的个体构成了一个空间,即总体空间。通常,一个总体被认为是无限的,因为其中的个体数量可能是无限的,或者由于数量巨大以至于在一次有限的观察中不可能全部观察到。一个总体有可能进入一个或一群观察主体的一个特定的观察或试验范畴。
Population or Population space(总体或总体空间): a population, denoted by P (kunstler script), is a group or set of all individuals with all the same invariable and variable attributes. All the individuals in a population constitute a space, or population space. Usually a population is considered to be infinite since the individuals may be infinite or in a too large number to be obtained. A population may be entered into a scope of an observation or experiment taken by a subject or a group of subjects.
尺度空间:一个尺度空间(用符号Ω表示)是由一个可变属性的全部无重复或冲突的子属性或一次观察或试验中的全部可能结果构成的空间,例如,一个统计调查表就是一个尺度空间。由此,一个尺度空间是关于不变属性和可变属性的一个集合,且这个集合不能为空集,因为它是一个统计测量的工具。因此,这里对尺度空间的定义等同于现行概率论系统中的“样本空间”的定义。显然,一个尺度空间不能被说成是一个样本空间,因为它仅仅是一个测量工具而非一个样本本身。
Scale space: a scale space, denoted by Ω, is a space constructed with all possible sub-attributes or outcomes without duplicates or conflictions of a variable attribute in an observation or experiment, for example, a questionnaire for a statistical survey. Thus, a scale space is a set of invariable attributes and variable attributes and the set may not be empty. It is a tool for a statistical survey. So, the scale space here is equal to the "sample space" in the current probability theory. Clearly, a scale space cannot be called a sample space since it is just a measurement tool rather than the sample itself.
测度:一个测度(用符号M表示)是在一定的观察或试验范畴内有着特定目的的测量行为,通常由至少一个主体执行以便获得关于总体中一定数量的个体的不变属性和可变属性的原始记录和认知。特别地,在统计学中的所有测度都是随机测度,因为任何被测对象都是随机得到的。
Measure: a measure, denoted by M, is an action taken by at least one subject in order to obtain original records or cognitions on all invariable and variable attributes with a certain number of individuals defined and selected in an observation or experiment for a specific purpose. Especially in Statistics, any measure is a random measure since any object that is measured is randomly obtained.
分布:一个分布(用符号D表示)是关于个体的观察结果在尺度空间上的表达。
Distribution: a distribution, denoted by D, is a result of a measuring action on a scale space.
样本:一个样本(用符号S表示)是一个测量行为中全部被观察个体的全部结果,因此,它是一个尺度空间上完整的分布。一个样本是总体的一个随机子集。不存在没有尺度空间相关联的独立样本;反之亦然。在统计学范畴内,一个样本通常也被称为是一个数据集。由于总体中个体的无限性,一个样本应该通过一个随机机制获得从而使得其对总体的代表性得到一定程度的保证。由此,统计学范畴内的任何样本都是一个随机样本。在统计学中,样本中的一个个体通常被称为一个“观察”或“随机样本点”或简称“样本点”,因此,一个样本中一个个体或观察或样本点不能再被称为是一个“样本”;否则将引起概念间的混淆甚至冲突;除非在一个测量行为中只有一个个体被观察到,此时,一个样本就等于这个个体。一般而言,一个样本自身作为一个整体在另一个观察范畴内是一个个体,但却是不同于样本中的个体的个体。这个作为“个体”的样本也应该拥有其自身的属性,即样本属性,且每一个属性也应该是确定的,恰如以上讨论的关于总体中个体的属性的性质一样。
Sample(样本): a sample, denoted by S, is a complete result of all individuals in a measuring action, thus it is a complete distribution over a scale space. It is a random subset of a population. There should be no independent sample without a scale space associated with it, and vice versa. In the domain of Statistics, a sample is often called a dataset. A sample should be obtained with a random mechanism in order to be guaranteed to be a representative of a population since the individuals in a population are usually infinite. Thus, any sample in the domain of Statistics is a random sample. In Statistics, an individual in a sample is often called an observation or a random sample point or sample point in brief. Thus, an individual or an observation or a sample point in a sample cannot be called a “sample” again; otherwise it may cause confusions or conflictions with the sample itself, except in the case that only one individual is measured. In general, a sample itself as a whole is an individual in another scope of an observation, in which it is different from the individuals in the sample. It should have its own attributes, and every attribute should be certain, too, just as it is with any individual discussed above.
样本空间:一个样本空间(表示符号同样本)可以是一个样本自身或样本数据集,因为在任何样本中应该没有重复的个体记录,因而每个样本点都是一个独立的元素,即使在仅有一个离散变量而关于该变量的观察仅有两个或两个以上的子属性和三个或三个以上的观察个体的下情形中也是如此。换句话说,我们可以反问:如果一个样本自身不能被称为是一个样本空间,那么,还有什么其它的东西能被称为是样本空间呢?事实上,一个样本中的全部个体就构成了一个完整的空间,这个空间就是样本空间。
Sample space: a sample space, shares S with sample, can be the sample itself or the dataset since in any sample there should be no duplicates, thus each sample point is an independent element even in the case that there is only one discrete variable with two or more categories and three or more observations in the sample. In other words we can say that in contrary, if a sample itself can not be called a sample space, what else can it be?
可测空间:一个空间被认为是可测的,如果其中每一个体在尺度空间上可测。从而,一个总体是一个可测空间,因为其中所有的个体在一个尺度空间上应该是可测的。
Measurable space: a space is said to be measurable if everything in it can be measured on a scale space. Thus, a population is a measurable space since all individuals in it should be measurable on a scale space.
被测空间:一个空间被认为是被测的,如果其中每一个体被一个尺度空间所测,无论这个测量对于任一个体是否成功。从而,一个样本是一个被测空间。
Measured space: a space is said to be measured if everything in it is measured on a scale space, regardless that the measure on an individual is successful or unsuccessful. Thus, a sample is a measured space.
随机映射:它是一个随机机制,用符号M(字体:kunstler script)。通过它一个样本或样本空间被从一个可测空间或总体在尺度空间上得到。
Random mapping: it is a random mechanism by which a sample or sample space is obtained from a measurable space or population through a scale space, denoted by M (kunstler script).
概率空间:一个概率空间(用符号P表示)是一个被概率化为1的样本空间。我们不能将一个概率空间定义在一个总体空间或可测空间上,因为一个尺度空间对于一个总体来说可能不是一个完备的空间但对于样本来说却是完备的。此外,一个总体空间通常是未知的,因此,一个概率空间如果被定义在一个总体空间上将带给我们一个未知的空间,从而这样的定义是徒劳的。我们也不能将概率空间单独地定义在一个尺度空间上,因为后者不过是一个测量工具而非我们试图通过概率来认识的真实的随机世界。然而,一个概率空间应该是被定义在一个分布着样本空间中的全部被测个体的尺度空间上。因此,只有样本空间是一个完备的空间且可以在尺度空间上被概率化。当然,一个在数学上被很好地定义了的确定的完备空间也是可以被概率化为1的,只要它满足由当前知识系统设定的一些特定的条件,例如,所有理论分布,包括正态分布、标准正态分布、t-分布、F-分布以及卡方分布,等等。因此,如何概率化一个样本空间属于数学特别是概率论的范畴。
Probability space: a probability space, denoted by P, is a sample space which is probabilized into 1. We cannot define a probability space over a population space or measurable space since a scale space may not be a complete one for a population but is complete for a sample. In addition, a population space is usually unknown, so to define a probability space over a population space will give us an unknown space, thus the definition is in vain. We cannot define a probability space over a scale space alone either since the scale space is just a measurement tool rather than a real world that we try to know in statistics. However, a probability space should be defined over a scale space with all measured individuals in a sample space since the sample space is a distribution over the scale space. Thus, only the sample space is a complete space and can be probabilized over the scale space. Of course, a certain complete space that is well defined in mathematics may be probabilized into 1 as long as it satisfies some specific conditions in terms of the existing knowledge system, for example, all the theoretical distributions, such as normal distribution, standard normal distribution, t-distribution, F-distribution as well as Chi-square distribution, etc. Therefore, how to probabilize a sample space belongs to the domain of Mathematics, especially the Theory of Probability.
空间的连续性和可连续性:由于总体的无限性,我们不能在总体空间上直接讨论空间的连续性,但可以经由样本来讨论这个问题。这里有两个不同的概念:一个是连续空间;另一个是可连续空间。一个连续空间不等于一个可连续空间。一个样本空间被认为是连续的,如果其中所有个体处于一个确定的子样本空间或整个样本空间自身之中,例如,100个男性的身高和100个女性的身高将各自被视为一个连续空间而不是一个可连续空间。然而,如果将这两个空间混合在一起,则这200人的身高将被视为是一个可连续空间而非一个连续空间,因为这个混合空间是由两个可识别的、相互重叠或分离的空间构成的。不过,这个混合空间仍然可以以一种连续测量的方式得到,且以“人的身高”为属性被定义为一个连续空间。
Continuity and Continuability of space: We cannot directly discuss the continuity over a population space but only on a sample space. There are two different concepts in this scope. One is continuous space, and the other is continuable space. A continuous space is not equal to a continuable space. A space is said to be continuous if all individuals in a sample are in a certain sub-sample or the whole sample itself, for example, the records of 100 males’ height and the records of 100 females’ height can be considered as a continuous space respectively. However, if we put them together, then the records of the 200 peoples’ height will be considered as a continuable space rather than a continuous space since this mixed space may be an overlapped or a separated space of the two continuous spaces. However, it can be measured in a continuous manner as a whole single space.
空间的不可分性和可分性:一个空间的可分性在离散空间里是很容易理解的。曾经令人在哲学上感到困难的是关于一个连续空间的可分性,例如,一块砖头是一个完整的连续空间,如果将它分开,势必要打破它。然而,在引入了空间的可连续性概念后,这样的理解就不会遇到任何逻辑障碍,例如,由两块砖头粘合起来的空间也可以被视为一个可连续空间,但却是一个可分离的空间,因此,一个可连续空间不等于一个连续空间,而一个可连续空间具有可分性。
Indivisibility and Divisibility of space: the divisibility of a space should be understood if the space is a discrete space. It is difficult to understand the divisibility over a continuous space in philosophy. However, after the concept of continuable space is introduced into the knowledge system, everything should be simple, since a continuable space is not equal to a continuous space. Thus, a continuable space may be divisible.
统计量:一个统计量(用符号s表示)是关于样本或样本空间的一个属性。由于样本是来自总体的一个随机子集,因此,一个统计量是一个随机的点测量。它也被认为是定义在样本空间也就是概率空间上的一个实可测函数。统计学所要做的正是构造特定的统计量以便对样本空间的属性作出描述,进而推断总体空间的相应属性(总体的属性通常用一个特定的术语即参数来称呼)。因此,一个统计量是一个随机的常量而非一般数学意义上的常量。一般数学意义上的常量通常没有任何形容词修饰,也就是一个常量是它自己。由此可知,一个样本中的全部记录也都可以被理解为随机常量。在统计学的范畴内,一个常量被认为是随机的仅仅是针对样本本身而言。因此,我们可以说一个统计量对于一个给定的样本来说是确定的,而对于总体来说则是非确定的。然而,一个样本统计量在不同的子样本之间以及它们与整个样本的统计量之间可以是不同的,因为任何子样本为其自身的统计量提供了较少的信息。例如,一个单一的全域回归模型提供了关于整个样本空间上的一个确定不变的回归关系,而分段回归模型将带给我们一组不同临界空间上的可变关系,从而,一个完整的样本空间可以被分割为若干个片段。
Statistic: a statistic, denoted by s, is an attribute about a sample or sample space. It is a random point measure since the sample is a random subset of a population. It is a real measurable function defined over a sample space thus a probability space. What Statistics does is to construct specific statistics to describe a sample space thus to infer the relevant attributes, which is denoted by a specific term, parameters, of the population space. Thus, a statistic is a random constant rather than a constant in mathematics, which is constant itself without a specific description. Thus, all records in a sample can be understood as random constants, too. A constant is said to be random only for a sample in the domain of Statistics. Therefore, we can say a statistic is certain to the given sample itself but uncertain to the population. However, a sample statistic may be different in a sub-sample from that of the sample since a sub-sample contributes less information to its own sub-sample statistic. For example, a single fullwise regression model will provide a certain or invariable regression relationship over the whole sample space; and a piecewise regression model will bring us a set of different regression relationships in different threshold intervals. Thus, the whole sample space can be segmented into several pieces or segments.
参数:一个参数(用符号p表示)是关于总体的一个属性,通常用一个相应的样本统计量来估计和推断,此时的总体参数可以被认为是不变的,且这一假设对总体来说不会导致损害。然而,我们必须意识到它在自身的自然历史中应该是可变的。
Parameter: a parameter, denoted by p, is an attribute of a population and will be estimated and inferred with a relevant sampling statistic. It can be treated as an invariable attribute in a statistical estimate since such a treatment doesn't matter to a population. However, we should have to believe that it is variable in the natural history of itself.
随机空间或随机系统:在我们所讨论的问题的范畴内,一个随机空间或随机系统(用符号R(字体:kunstler script)表示)是一个与上述全部概念相关联的抽象概念,也就是说,它是一个广义化的概念,而非上面提到的某个或某几个具体的概念。由于定义总体的不变属性和样本中个体的随机常量以及样本本身的全部统计量,一个随机空间可能包含了一定程度的确定性,从而在描述和推断总体时,我们的结论也就有了一定程度的确定性。但是,我们必须牢记,任何样本对其总体的非确定性是一个绝对的本质属性,因此,基于样本基础上的关于总体的全部描述在本质上是随机的或非确定性的。
Random space or Random system: a random space or random system, denoted by R (kunstler script), is an abstract concept associated with all the concepts above in the domain that we discussed here. It is a generalized concept without a specific object among the concrete concepts stated above. In other words, all of the above concepts constitute a complete random space. A random space may contain a sort of certainty due to the invariable attributes for defining a population, as well as the random constants of all the individuals and all the statistics, thus we will have a sort of certainty in our description and inference on the population. However, we must remember that the uncertainty of a sample to the population is absolute, thus all the descriptions about the population based on a sample are essentially random or uncertain.
(注:本概念系统于2010年10月18日在Wikipedia网站上关于Piecewise regression analysis的词条中提了出来,由于涉嫌原创性研究以及可能引起的巨大学术争论,被Wikipedia管理人员于当月27日将整个词条删除)