The variation series consists of Statistical study of variation series and calculation of average values

Condition:

There is data on the age composition of workers (years): 18, 38, 28, 29, 26, 38, 34, 22, 28, 30, 22, 23, 35, 33, 27, 24, 30, 32, 28, 25, 29, 26, 31, 24, 29, 27, 32, 25, 29, 29.

    1. Build an interval distribution series.
    2. Build a graphic representation of the series.
    3. Graphically determine the mode and median.

Solution:

1) According to the Sturgess formula, the population must be divided into 1 + 3.322 lg 30 = 6 groups.

The maximum age is 38, the minimum is 18.

Interval width Since the ends of the intervals must be integers, we will divide the population into 5 groups. Interval width - 4.

To facilitate the calculations, let's arrange the data in ascending order: 18, 22, 22, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 28, 28, 29, 29, 29, 29, 29, 30 , 30, 31, 32, 32, 33, 34, 35, 38, 38.

Age distribution of workers

Graphically, a series can be displayed as a histogram or a polygon. Histogram - bar graph. The base of the column is the width of the interval. The height of the bar is equal to the frequency.

A polygon (or distribution polygon) is a graph of frequencies. To build it according to the histogram, we connect the midpoints of the upper sides of the rectangles. We close the polygon on the x-axis at distances equal to half the interval from the extreme x values.

Mode (Mo) is the value of the trait under study, which occurs most frequently in a given population.

To determine the mode from the histogram, you need to select the highest rectangle, draw a line from the right vertex of this rectangle to the upper right corner of the previous rectangle, and draw a line from the left vertex of the modal rectangle to the left vertex of the next rectangle. From the point of intersection of these lines, draw a perpendicular to the x-axis. The abscissa will be fashion. Mo ≈ 27.5. This means that the most common age in this population is 27-28 years.

The median (Me) is the value of the trait under study, which is in the middle of an ordered variation series.

We find the median by the cumulate. Cumulate - graph of accumulated frequencies. Abscissas are variants of a series. The ordinates are the accumulated frequencies.

To determine the median for the cumulate, we find along the ordinate axis a point corresponding to 50% of the accumulated frequencies (in our case, 15), draw a straight line through it, parallel to the Ox axis, and draw a perpendicular to the x axis from the point of its intersection with the cumulate. The abscissa is the median. Me ≈ 25.9. This means that half of the workers in this population are under 26 years of age.

The set of values ​​of the parameter studied in a given experiment or observation, ranked by magnitude (increase or decrease) is called a variation series.

Let's assume that we measured the blood pressure of ten patients in order to obtain an upper BP threshold: systolic pressure, i.e. only one number.

Imagine that a series of observations (statistical population) of arterial systolic pressure in 10 observations has the following form (Table 1):

Table 1

The components of a variational series are called variants. Variants represent the numerical value of the trait being studied.

The construction of a variational series from a statistical set of observations is only the first step towards comprehending the features of the entire set. Next, it is necessary to determine the average level of the studied quantitative trait (the average level of blood protein, the average weight of patients, the average time of onset of anesthesia, etc.)

The average level is measured using criteria that are called averages. The average value is a generalizing numerical characteristic of qualitatively homogeneous values, characterizing by one number the entire statistical population according to one attribute. The average value expresses the general that is characteristic of a trait in a given set of observations.

There are three types of averages in common use: mode (), median () and arithmetic mean ().

To determine any average value, it is necessary to use the results of individual observations, writing them in the form of a variation series (Table 2).

Fashion- the value that occurs most frequently in a series of observations. In our example, mode = 120. If there are no repeating values ​​in the variation series, then they say that there is no mode. If several values ​​are repeated the same number of times, then the smallest of them is taken as the mode.

Median- the value dividing the distribution into two equal parts, the central or median value of a series of observations ordered in ascending or descending order. So, if there are 5 values ​​in the variation series, then its median is equal to the third member of the variation series, if there is an even number of members in the series, then the median is the arithmetic mean of its two central observations, i.e. if there are 10 observations in the series, then the median is equal to the arithmetic mean of 5 and 6 observations. In our example.

Note an important feature of the mode and median: their values ​​are not affected by the numerical values ​​of the extreme variants.

Arithmetic mean calculated by the formula:

where is the observed value in the -th observation, and is the number of observations. For our case.

The arithmetic mean has three properties:

The middle one occupies the middle position in the variation series. In a strictly symmetrical row.

The average is a generalizing value and random fluctuations, differences in individual data are not visible behind the average. It reflects the typical that is characteristic of the entire population.

The sum of deviations of all variants from the mean is equal to zero: . The deviation of the variant from the mean is indicated.

The variation series consists of variants and their corresponding frequencies. Of the ten values ​​obtained, the number 120 was encountered 6 times, 115 - 3 times, 125 - 1 time. Frequency () - the absolute number of individual options in the population, indicating how many times this option occurs in the variation series.

The variation series can be simple (frequencies = 1) or grouped shortened, 3-5 options each. A simple series is used with a small number of observations (), grouped - with a large number of observations ().

Let's call different sample values options a series of values ​​and denote: X 1 , X 2, …. First of all, let's make ranging options, i.e. arrange them in ascending or descending order. For each option, its own weight is indicated, i.e. a number that characterizes the contribution of this option to the total population. Frequencies or frequencies act as weights.

Frequency n i option x i called a number showing how many times this option occurs in the considered sample population.

Frequency or relative frequency w i option x i a number equal to the ratio of the frequency of a variant to the sum of the frequencies of all variants is called. The frequency shows what part of the units of the sample population has a given variant.

The sequence of options with their corresponding weights (frequencies or frequencies), written in ascending (or descending) order, is called variational series.

Variational series are discrete and interval.

For a discrete variational series, the point values ​​of the attribute are specified, for the interval series, the attribute values ​​are specified in the form of intervals. Variation series can show the distribution of frequencies or relative frequencies (frequencies), depending on what value is indicated for each option - frequency or frequency.

Discrete variation series of frequency distribution looks like:

Frequencies are found by the formula , i = 1, 2, …, m.

w 1 +w 2 + … + w m = 1.

Example 4.1. For a given set of numbers

4, 6, 6, 3, 4, 9, 6, 4, 6, 6

construct discrete variation series distributions of frequencies and frequencies.

Solution . The volume of the population is n= 10. The discrete frequency distribution series has the form

Interval series have a similar form of recording.

Interval variation series of frequency distribution is written as:

The sum of all frequencies is equal to the total number of observations, i.e. total volume: n = n 1 +n 2 + … + n m .

Interval variation series of distribution of relative frequencies (frequencies) looks like:

The frequency is found by the formula , i = 1, 2, …, m.

The sum of all frequencies is equal to one: w 1 +w 2 + … + w m = 1.

Most often in practice, interval series are used. If there are a lot of statistical sample data and their values ​​differ from each other by an arbitrarily small amount, then the discrete series for these data will be quite cumbersome and inconvenient for further research. In this case, data grouping is used, i.e. the interval containing all the values ​​of the attribute is divided into several partial intervals and, having calculated the frequency for each interval, an interval series is obtained. Let us write down in more detail the scheme for constructing an interval series, assuming that the lengths of partial intervals will be the same.

2.2 Building an interval series

To build an interval series, you need:

Determine the number of intervals;

Determine the length of the intervals;

Determine the location of the intervals on the axis.

For determining number of intervals k there is the Sturges formula, according to which

,

where n- the volume of the totality.

For example, if there are 100 characteristic values ​​(variant), then it is recommended to take the number of intervals equal to the intervals to construct an interval series.

However, very often in practice the number of intervals is chosen by the researcher himself, considering that this number should not be very large, so that the series is not cumbersome, but also not very small, so as not to lose some properties of the distribution.

Interval length h is determined by the following formula:

,

where x max and x min is the largest and smallest values ​​of the options, respectively.

the value called on a grand scale row.

To construct the intervals themselves, they proceed in different ways. One of the most simple ways is as follows. The value is taken as the beginning of the first interval
. Then the rest of the boundaries of the intervals are found by the formula . Obviously, the end of the last interval a m+1 must satisfy the condition

After all boundaries of the intervals are found, the frequencies (or frequencies) of these intervals are determined. To solve this problem, look through all the options and determine the number of options that fall into a particular interval. We will consider the complete construction of an interval series using an example.

Example 4.2. For the following statistics, written in ascending order, build an interval series with the number of intervals equal to 5:

11, 12, 12, 14, 14, 15, 21, 21, 22, 23, 25, 38, 38, 39, 42, 42, 44, 45, 50, 50, 55, 56, 58, 60, 62, 63, 65, 68, 68, 68, 70, 75, 78, 78, 78, 78, 80, 80, 86, 88, 90, 91, 91, 91, 91, 91, 93, 93, 95, 96.

Solution. Total n=50 variant values.

The number of intervals is specified in the problem condition, i.e. k=5.

The length of the intervals is
.

Let's define the boundaries of the intervals:

a 1 = 11 − 8,5 = 2,5; a 2 = 2,5 + 17 = 19,5; a 3 = 19,5 + 17 = 36,5;

a 4 = 36,5 + 17 = 53,5; a 5 = 53,5 + 17 = 70,5; a 6 = 70,5 + 17 = 87,5;

a 7 = 87,5 +17 = 104,5.

To determine the frequency of intervals, we count the number of options that fall into this interval. For example, the options 11, 12, 12, 14, 14, 15 fall into the first interval from 2.5 to 19.5. Their number is 6, therefore, the frequency of the first interval is n 1=6. The frequency of the first interval is . Variants 21, 21, 22, 23, 25, the number of which is 5, fall into the second interval from 19.5 to 36.5. Therefore, the frequency of the second interval is n 2 =5, and the frequency . Having similarly found frequencies and frequencies for all intervals, we obtain the following interval series.

The interval series of the frequency distribution has the form:

The sum of the frequencies is 6+5+9+11+8+11=50.

The interval series of the frequency distribution has the form:

The sum of the frequencies is 0.12+0.1+0.18+0.22+0.16+0.22=1. ■

When constructing interval series, depending on the specific conditions of the problem under consideration, other rules can be applied, namely

1. Interval variation series may consist of partial intervals of different lengths. Unequal lengths of intervals make it possible to single out the properties of a statistical population with an uneven distribution of a feature. For example, if the boundaries of the intervals determine the number of inhabitants in cities, then it is advisable in this problem to use intervals that are unequal in length. Obviously, for not big cities a small difference in the number of inhabitants also matters, and for large cities the difference in tens and hundreds of inhabitants is not significant. Interval series with unequal lengths of partial intervals are studied mainly in the general theory of statistics and their consideration is beyond the scope of this manual.

2. In mathematical statistics, interval series are sometimes considered, for which the left boundary of the first interval is assumed to be –∞, and the right boundary of the last interval is +∞. This is done in order to bring the statistical distribution closer to the theoretical one.

3. When constructing interval series, it may turn out that the value of some variant coincides exactly with the interval boundary. The best thing to do in this case is as follows. If there is only one such coincidence, then consider that the variant under consideration with its frequency fell into the interval closer to the middle of the interval series, if there are several such variants, then either all of them are assigned to the intervals to the right of these variant, or all to the left.

4. After determining the number of intervals and their length, the location of the intervals can be done in another way. Find the arithmetic mean of all the considered values ​​of the options X cf. and build the first interval in such a way that this sample mean would be inside some interval. Thus, we get the interval from X cf. – 0.5 h before X avg. + 0.5 h. Then left and right, adding the length of the interval, we build the remaining intervals until x min and x max will not fall into the first and last intervals, respectively.

5. Interval series with a large number of intervals are conveniently written vertically, i.e. record the intervals not in the first line, but in the first column, and the frequencies (or frequencies) in the second column.

Sample data can be considered as values ​​of some random variable X. A random variable has its own distribution law. It is known from probability theory that the law of distribution of a discrete random variable can be specified as a distribution series, and for a continuous one, using the distribution density function. However, there is a universal distribution law that holds for both discrete and continuous random variables. This distribution law is given as a distribution function F(x) = P(X<x). For sample data, you can specify an analogue of the distribution function - the empirical distribution function.

grouping- this is the division of the population into groups that are homogeneous in some way.

Service assignment. With the online calculator you can:

  • build a variation series, build a histogram and a polygon;
  • find indicators of variation (mean, mode (including graphically), median, range of variation, quartiles, deciles, quartile coefficient of differentiation, coefficient of variation and other indicators);

Instruction. To group a series, you must select the type of the resulting variation series (discrete or interval) and specify the amount of data (number of rows). The resulting solution is saved in a Word file (see the example of grouping statistical data).

If the grouping has already been done and the discrete variation series or interval series, then you need to use the online calculator Variation indicators. Testing the hypothesis about the type of distribution produced using the service Study of the form of distribution.

Types of statistical groupings

Variation series. In the case of observations of a discrete random variable, the same value can be encountered several times. Such values ​​\u200b\u200bof a random variable x i are recorded indicating n i the number of times it appears in n observations, this is the frequency of this value.
In the case of a continuous random variable, grouping is used in practice.
  1. Typological grouping- this is the division of the studied qualitatively heterogeneous population into classes, socio-economic types, homogeneous groups of units. To build this grouping, use the Discrete variational series parameter.
  2. Structural grouping is called, in which a homogeneous population is divided into groups that characterize its structure according to some varying feature. To build this grouping, use the Interval series parameter.
  3. A grouping that reveals the relationship between the studied phenomena and their features is called analytical group(see analytical grouping of series).

Example #1. According to table 2, build the distribution series for 40 commercial banks of the Russian Federation. According to the obtained distribution series, determine: average profit per one commercial bank, credit investments on average per one commercial bank, modal and median value of profit; quartiles, deciles, range of variation, mean linear deviation, standard deviation, coefficient of variation.

Solution:
In chapter "Type of statistical series" choose Discrete Series. Click Paste from Excel. Number of groups: according to the Sturgess formula

Principles of building statistical groupings

A series of observations ordered in ascending order is called a variation series. grouping sign is the sign by which the population is divided into separate groups. It is called the base of the group. Grouping can be based on both quantitative and qualitative characteristics.
After determining the basis of the grouping, the question of the number of groups into which the study population should be divided should be decided.

When using personal computers for processing statistical data, the grouping of units of an object is carried out using standard procedures.
One such procedure is based on using the Sturgess formula to determine the optimal number of groups:

k = 1+3.322*lg(N)

Where k is the number of groups, N is the number of population units.

The length of the partial intervals is calculated as h=(x max -x min)/k

Then count the number of hits of observations in these intervals, which are taken as frequencies n i . Few frequencies, the values ​​of which are less than 5 (n i< 5), следует объединить. в этом случае надо объединить и соответствующие интервалы.
The midpoints of the intervals x i =(c i-1 +c i)/2 are taken as new values.

Example #3. As a result of a 5% self-random sample, the following distribution of products by moisture content was obtained. Calculate: 1) the average percentage of humidity; 2) indicators characterizing the variation in humidity.
The solution was obtained using a calculator: Example No. 1

Build a variation series. Based on the found series, construct a distribution polygon, a histogram, and a cumulate. Determine the mode and median.
Download Solution

Example. According to the results of selective observation (sample A appendix):
a) make a series of variations;
b) calculate the relative frequencies and accumulated relative frequencies;
c) build a polygon;
d) compose an empirical distribution function;
e) plot the empirical distribution function;
f) calculate numerical characteristics: arithmetic mean, variance, standard deviation. Solution

Based on the data given in Table 4 (Appendix 1) and corresponding to your option, perform:

  1. Based on the structural grouping, construct a variational frequency and cumulative distribution series using equal closed intervals, assuming the number of groups is 6. Present the results in a table and graphically.
  2. Analyze the variational distribution series by calculating:
    • arithmetic mean value of the feature;
    • mode, median, 1st quartile, 1st and 9th decile;
    • standard deviation;
    • the coefficient of variation.
  3. To conclude.

Required: to rank the series, to build an interval distribution series, to calculate the mean value, the variance of the mean value, the mode and median for the ranged and interval series.

Based on the initial data, construct a discrete variational series; present it in the form of a statistical table and statistical graphs. 2). Based on the initial data, construct an interval variation series with equal intervals. Choose the number of intervals yourself and explain this choice. Present the resulting variation series in the form of a statistical table and statistical graphs. Indicate the types of tables and graphs used.

In order to determine the average duration of customer service in a pension fund, the number of customers of which is very large, a survey of 100 customers was conducted according to the scheme of self-random non-repetitive sampling. The survey results are presented in the table. Find:
a) the boundaries within which, with a probability of 0.9946, the average service time for all clients of the pension fund is concluded;
b) the probability that the share of all fund clients with a service duration of less than 6 minutes differs from the share of such clients in the sample by no more than 10% (in absolute value);
c) resampling volume, at which with a probability of 0.9907 it can be argued that the share of all fund clients with a service duration of less than 6 minutes differs from the share of such clients in the sample by no more than 10% (in absolute value).
2. According to task 1, using Pearson's X 2 test, at the significance level α = 0.05, test the hypothesis that the random variable X - customer service time - is distributed according to the normal law. Construct on one drawing a histogram of the empirical distribution and the corresponding normal curve.
Download Solution

Given a sample of 100 items. Necessary:

  1. Build a ranked variational series;
  2. Find the maximum and minimum terms of the series;
  3. Find the range of variation and the number of optimal intervals for constructing an interval series. Find the length of the interval of the interval series;
  4. Build an interval series. Find the frequencies of the elements of the sample falling into the composed gaps. Find the midpoints of each interval;
  5. Construct a histogram and a polygon of frequencies. Compare with normal distribution (analytically and graphically);
  6. Plot the empirical distribution function;
  7. Calculate sample numerical characteristics: sample mean and central sample moment;
  8. Calculate approximate values ​​of standard deviation, skewness and kurtosis (using MS Excel analysis package). Compare approximate calculated values ​​with exact ones (calculated using MS Excel formulas);
  9. Compare selected graphic characteristics with the corresponding theoretical ones.
Download Solution

We have the following sample data (10% sample, mechanical) on the output and the amount of profit, million rubles. According to the original data:
Task 13.1.
13.1.1. Build a statistical series of distribution of enterprises by the amount of profit, forming five groups at equal intervals. Plot distribution series plots.
13.1.2. Calculate the numerical characteristics of a series of distribution of enterprises by the amount of profit: arithmetic mean, standard deviation, variance, coefficient of variation V. Draw conclusions.
Task 13.2.
13.2.1. Determine the boundaries within which, with a probability of 0.997, the amount of profit of one enterprise in the general population is concluded.
13.2.2. Using Pearson's x2-criterion, at a significance level α, test the hypothesis that the random variable X - the amount of profit - is distributed according to the normal law.
Task 13.3.
13.3.1. Determine the coefficients of the sample regression equation.
13.3.2. Establish the presence and nature of the correlation between the cost of manufactured products (X) and the amount of profit per enterprise (Y). Plot a scatterplot and a regression line.
13.3.3. Calculate the linear correlation coefficient. Using Student's t-test, check the significance of the correlation coefficient. Draw a conclusion about the closeness of the relationship between the factors X and Y using the Chaddock scale.
Guidelines. Task 13.3 is performed using this service.
Download Solution

A task. The following data represents the amount of time spent by clients in concluding contracts. Build an interval variation series of the presented data, a histogram, find an unbiased estimate of the mathematical expectation, a biased and unbiased estimate of the variance.

Example. According to table 2:
1) Build distribution series for 40 commercial banks of the Russian Federation:
A) by the amount of profit;
B) by the amount of credit investments.
2) Based on the distribution series obtained, determine:
A) average profit per commercial bank;
B) credit investments on average per commercial bank;
C) modal and median value of profit; quartiles, deciles;
D) modal and median value of credit investments.
3) According to the distribution series obtained in paragraph 1, calculate:
a) range of variation;
b) average linear deviation;
c) standard deviation;
d) coefficient of variation.
Record the necessary calculations in tabular form. Analyze the results. Draw your own conclusions.
Plot the resulting distribution series. Determine the mode and median graphically.

Solution:
To build a grouping with equal intervals, we will use the service Grouping of statistical data.

Figure 1 - Entering parameters

Description of parameters
Number of lines: amount of raw data. If the dimension of the series is small, indicate its number. If the selection is large enough, then click the Paste from Excel button.
Number of groups: 0 - the number of groups will be determined by the Sturgess formula.
If a specific number of groups is specified, specify it (for example, 5).
Row type: Discrete series.
Significance level: for example, 0.954 . This parameter is set to define the confidence interval for the mean.
Sample: For example, a 10% mechanical sampling was carried out. Specify the number 10. For our data, we specify 100 .
  • introductory lesson is free;
  • A large number of experienced teachers (native and Russian-speaking);
  • Courses NOT for a specific period (month, six months, year), but for a specific number of lessons (5, 10, 20, 50);
  • Over 10,000 satisfied customers.
  • The cost of one lesson with a Russian-speaking teacher - from 600 rubles, with a native speaker - from 1500 rubles

The concept of a variation series. The first step in systematizing the materials of statistical observation is counting the number of units that have one or another characteristic. Having arranged the units in ascending or descending order of their quantitative attribute and counting the number of units with a specific attribute value, we obtain a variation series. The variation series characterizes the distribution of units of a certain statistical population according to some quantitative attribute.

The variation series consists of two columns, the left column contains the values ​​of the variable attribute, called variants and denoted by (x), and the right column contains absolute numbers showing how many times each variant occurs. The values ​​in this column are called frequencies and are denoted by (f).

Schematically, the variation series can be represented in the form of Table 5.1:

Table 5.1

Type of variation series

Options (x)

Frequencies (f)

In the right column, relative indicators characterizing the proportion of the frequency of individual variants in the total amount of frequencies can also be used. These relative indicators are called frequencies and are conventionally denoted by , i.e. . The sum of all frequencies is equal to one. Frequencies can also be expressed as a percentage, and then their sum will be equal to 100%.

Variable signs can be of a different nature. Variants of some signs are expressed in integers, for example, the number of rooms in an apartment, the number of published books, etc. These signs are called discontinuous, or discrete. Variants of other features can take on any values ​​within certain limits, such as the fulfillment of planned targets, wages, etc. These features are called continuous.

Discrete variation series. If the variants of the variational series are expressed as discrete values, then such a variational series is called discrete, its appearance is presented in Table. 5.2:

Table 5.2

Distribution of students by grades obtained in the exam

Ratings (x)

Number of students (f)

In % of total ()

The nature of the distribution in discrete series is depicted graphically as a distribution polygon, Fig.5.1.

Rice. 5.1. Distribution of students by grades obtained in the exam.

Interval variation series. For continuous features, variation series are constructed as interval series, i.e. feature values ​​in them are expressed as intervals "from and to". In this case, the minimum value of a feature in such an interval is called the lower limit of the interval, and the maximum value is called the upper limit of the interval.

Interval variational series are built both for discontinuous features (discrete) and for those varying over a wide range. Interval rows can be with equal and unequal intervals. In economic practice, for the most part, unequal intervals are used, progressively increasing or decreasing. Such a need arises especially in cases where the fluctuation of the characteristic is uneven and over a wide range.

Consider the type of interval series with equal intervals, Table. 5.3:

Table 5.3

Distribution of workers by output

Output, tr. (X)

Number of workers (f)

Cumulative frequency (f´)

The interval distribution series is graphically depicted as a histogram, Fig.5.2.

Fig.5.2. Distribution of workers by output

Accumulated (cumulative) frequency. In practice, there is a need to convert the distribution series into cumulative series, built on the accumulated frequencies. They can be used to define structural averages that facilitate the analysis of distribution series data.

The cumulative frequencies are determined by successively adding to the frequencies (or frequencies) of the first group of these indicators of the subsequent groups of the distribution series. Cumulates and ogives are used to illustrate the distribution series. To build them, the values ​​of a discrete feature (or the ends of the intervals) are marked on the abscissa axis, and the growing totals of frequencies (cumulate) are marked on the ordinate axis, Fig.5.3.

Rice. 5.3. The cumulative distribution of workers in the development

If the scales of frequencies and options are interchanged, i.e. reflect the accumulated frequencies on the abscissa axis, and the values ​​​​of the options on the ordinate axis, then the curve characterizing the change in frequencies from group to group will be called the distribution ogive, Fig. 5.4.

Rice. 5.4. Ogiva distribution of workers for production

Variation series with equal intervals provide one of the most important requirements for statistical distribution series, ensuring their comparability in time and space.

Distribution density. However, the frequencies of individual unequal intervals in these series are not directly comparable. In such cases, to ensure the necessary comparability, the distribution density is calculated, i.e. determine how many units in each group are per unit of interval value.

When constructing a graph of the distribution of a variational series with unequal intervals, the height of the rectangles is determined in proportion not to the frequencies, but to the indicators of the distribution density of the values ​​of the studied trait in the corresponding intervals.

Compilation of a variational series and its graphical representation is the first step in processing the initial data and the first step in the analysis of the studied population. The next step in the analysis of variational series is the determination of the main generalizing indicators, called the characteristics of the series. These characteristics should give an idea of ​​the average value of the attribute in the units of the population.

average value. The average value is a generalized characteristic of the studied trait in the studied population, reflecting its typical level per population unit in specific conditions of place and time.

The average value is always named, has the same dimension as the attribute of individual units of the population.

Before calculating the average values, it is necessary to group the units of the studied population, highlighting qualitatively homogeneous groups.

The average calculated for the population as a whole is called the general average, and for each group - group averages.

There are two types of averages: power (arithmetic average, harmonic average, geometric average, root mean quadratic); structural (mode, median, quartiles, deciles).

The choice of the average for the calculation depends on the purpose.

Types of power averages and methods for their calculation. In the practice of statistical processing of the collected material, various problems arise, for the solution of which different averages are required.

Mathematical statistics derive various means from power mean formulas:

where is the average value; x - individual options (feature values); z - exponent (at z = 1 - arithmetic mean, z = 0 geometric mean, z = - 1 - harmonic mean, z = 2 - mean quadratic).

However, the question of what type of average should be applied in each individual case is resolved by a specific analysis of the population under study.

The most common type of average in statistics is arithmetic mean. It is calculated in those cases when the volume of the averaged attribute is formed as the sum of its values ​​for individual units of the studied statistical population.

Depending on the nature of the initial data, the arithmetic mean is determined in various ways:

If the data is ungrouped, then the calculation is carried out according to the formula of a simple average value

Calculation of the arithmetic mean in a discrete series occurs according to the formula 3.4.

Calculation of the arithmetic mean in the interval series. In an interval variation series, where the middle of the interval is conditionally taken as the value of a feature in each group, the arithmetic mean may differ from the mean calculated from ungrouped data. Moreover, the larger the interval in groups, the greater the possible deviations of the average calculated from the grouped data from the average calculated from the ungrouped data.

When calculating the average for an interval variation series, in order to perform the necessary calculations, one passes from the intervals to their midpoints. And then calculate the average value by the formula of the arithmetic weighted average.

Properties of the arithmetic mean. The arithmetic mean has some properties that allow us to simplify calculations, let's consider them.

1. The arithmetic mean of the constant numbers is equal to this constant number.

If x = a. Then .

2. If the weights of all options are proportionally changed, i.e. increase or decrease by the same number of times, then the arithmetic mean of the new series will not change from this.

If all weights f are reduced by k times, then .

3. The sum of positive and negative deviations of individual options from the average, multiplied by the weights, is equal to zero, i.e.

If , then . From here.

If all options are reduced or increased by some number, then the arithmetic mean of the new series will decrease or increase by the same amount.

Reduce all options x on the a, i.e. x´ = xa.

Then

The arithmetic mean of the initial series can be obtained by adding to the reduced mean the number previously subtracted from the variants a, i.e. .

5. If all options are reduced or increased in k times, then the arithmetic mean of the new series will decrease or increase by the same amount, i.e. in k once.

Let then .

Hence , i.e. to obtain the average of the original series, the arithmetic mean of the new series (with reduced options) must be increased by k once.

Average harmonic. The harmonic mean is the reciprocal of the arithmetic mean. It is used when statistical information does not contain frequencies for individual population options, but is presented as their product (M = xf). The harmonic mean will be calculated using formula 3.5

The practical application of the harmonic mean is to calculate some indices, in particular, the price index.

Geometric mean. When using the geometric mean, the individual values ​​of the attribute are, as a rule, relative values ​​of the dynamics, built in the form of chain values, as a ratio to the previous level of each level in the dynamics series. The average thus characterizes the average growth rate.

The geometric mean is also used to determine the equidistant value from the maximum and minimum values ​​of the attribute. For example, an insurance company enters into contracts for the provision of auto insurance services. Depending on the specific insured event, the insurance payment may vary from 10,000 to 100,000 dollars per year. The average insurance payout is US$.

The geometric mean is the value used as the average of the ratios or in the distribution series, presented as a geometric progression, when z = 0. This average is convenient to use when attention is paid not to absolute differences, but to the ratios of two numbers.

Formulas for calculation are as follows

where are variants of the averaged feature; - the product of options; f– frequency of options.

The geometric mean is used in calculating average annual growth rates.

Mean square. The root mean square formula is used to measure the degree of fluctuation of the individual values ​​of a trait around the arithmetic mean in the distribution series. So, when calculating the indicators of variation, the average is calculated from the squares of the deviations of the individual values ​​of the trait from the arithmetic mean.

The mean square value is calculated by the formula

In economic research, the modified form of the root mean square is widely used in the calculation of indicators of the variation of a feature, such as dispersion, standard deviation.

Majority rule. There is the following relationship between power-law averages - the larger the exponent, the greater the value of the average, Table 5.4:

Table 5.4

Relationship between averages

z value

The ratio between the averages

This relation is called the rule of majorance.

Structural averages. To characterize the structure of the population, special indicators are used, which can be called structural averages. These measures include mode, median, quartiles, and deciles.

Fashion. Mode (Mo) is the most frequently occurring value of a feature in population units. Mode is the value of the feature that corresponds to the maximum point of the theoretical distribution curve.

Fashion is widely used in commercial practice in the study of consumer demand (when determining the size of clothes and shoes that are in great demand), price registration. There can be several mods in total.

Mode calculation in a discrete series. In a discrete series, the mode is the variant with the highest frequency. Consider finding a mode in a discrete series.

Calculation of fashion in an interval series. In the interval variation series, the central variant of the modal interval is approximately considered to be a mode, i.e. the interval that has the highest frequency (frequency). Within the interval, it is necessary to find the value of the attribute, which is the mode. For an interval series, the mode will be determined by the formula

where is the lower limit of the modal interval; is the value of the modal interval; is the frequency corresponding to the modal interval; is the frequency preceding the modal interval; is the frequency of the interval following the modal.

Median. The median () is the value of the feature in the middle unit of the ranked series. A ranked series is a series in which the characteristic values ​​are written in ascending or descending order. Or the median is a value that divides the number of an ordered variational series into two equal parts: one part has a value of a variable feature that is less than the average variant, and the other is large.

To find the median, its serial number is first determined. To do this, with an odd number of units, one is added to the sum of all frequencies and everything is divided by two. With an even number of units, the median is found as the value of the attribute of the unit, the serial number of which is determined by the total sum of frequencies divided by two. Knowing the ordinal number of the median, it is easy to find its value from the accumulated frequencies.

Calculation of the median in a discrete series. According to the sample survey, data were obtained on the distribution of families by the number of children, Table. 5.5. To determine the median, first determine its ordinal number

=

Then we build a series of accumulated frequencies (, by the serial number and the accumulated frequency we find the median. The accumulated frequency 33 shows that in 33 families the number of children does not exceed 1 child, but since the number of the median is 50, the median will be in the interval from 34 to 55 families.

Table 5.5

Distribution of the number of families from the number of children

Number of children in the family

The number of families, is the value of the median interval;

All considered forms of the power mean have an important property (in contrast to structural means) – the formula for determining the mean includes all values ​​of the series i.e. the size of the average is influenced by the value of each option.

On the one hand, this is a very positive property. in this case, the effect of all causes affecting all units of the population under study is taken into account. On the other hand, even one observation that was accidentally included in the initial data can significantly distort the idea of ​​the level of development of the studied trait in the population under consideration (especially in short series).

Quartiles and deciles. By analogy with finding the median in variational series, one can find the value of a feature in any ranked series unit in order. So, in particular, one can find the value of a feature for units dividing the series into 4 equal parts, into 10, etc.

Quartiles. Variants that divide the ranked series into four equal parts are called quartiles.

At the same time, the following are distinguished: the lower (or first) quartile (Q1) is the value of the feature at the unit of the ranked series, dividing the population in the ratio of ¼ to ¾ and the upper (or third) quartile (Q3) is the value of the feature at the unit of the ranked series, dividing the population in the ratio ¾ to ¼.

The second quartile is the median Q2 = Me. The lower and upper quartiles in the interval series are calculated using the formula similar to the median.

where is the lower limit of the interval containing the lower and upper quartiles, respectively;

is the cumulative frequency of the interval preceding the interval containing the lower or upper quartile;

– frequencies of quartile intervals (lower and upper)

The intervals containing Q1 and Q3 are determined from the accumulated frequencies (or frequencies).

Deciles. In addition to quartiles, deciles are calculated - options that divide the ranked series into 10 equal parts.

They are denoted by D, the first decile D1 divides the series in the ratio of 1/10 and 9/10, the second D2 - 2/10 and 8/10, etc. They are calculated in the same way as the median and quartiles.

Both the median, and quartiles, and deciles belong to the so-called ordinal statistics, which is understood as a variant that occupies a certain ordinal place in a ranked series.