





























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document is about the basics about statistics, data, types of data etc
Typology: Study notes
1 / 37
This page cannot be seen from the preview
Don't miss anything!






























Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. Statistics is a branch of mathematics that transforms data into useful information for decision makers. Statistics is the science of data: The Scientific Method Formulate a theory Collect data to test the theory Analyze the results Interpret the results, and make decisions
A.L. Bowley has defined statistics as: (i) Statistics is the science of counting, (ii) Statistics may rightly be called the science of averages, and (iii) Statistics is the science of measurement of social organism regarded as a whole in all its manifestations Boddington defined as: Statistics is the science of estimates and probabilities. Agresti & Finlay,1997: Statistics consists of a body of methods for collecting and analyzing data.
Statistics is much more than just the tabulation of numbers and the graphical presentation of these tabulated numbers. Statistics is the science of gaining information from numerical and categorical1 data.
Statistical methods can be used to find answers to the questions like:
That is, statistics provides methods for:
Example Statistics in practice: Consider the following problems:
Population and sample are two basic concepts of statistics.
Population can be characterized as the set of individual persons or objects in which an investigator is primarily interested during his or her research problem. Sometimes wanted measurements for all individuals in the population are obtained, but often only a set of individuals of that population are observed; such a set of individuals constitutes a sample. Population is the collection of all individuals or items under consideration in a statistical study. (Weiss, 1999) Sample is that part of the population from which information is collected. (Weiss, 1999)
Example Finite population: In many cases the population under consideration is one which could be physically listed. For example: – The students of the University of Tampere,
to make more thorough analysis of the subject under investigation. Furthermore, the preliminary descriptive analysis of a sample often reveals features that lead to the choice of the appropriate inferential method to be later used. Sometimes it is possible to collect the data from the whole population. In that case it is possible to perform a descriptive study on the population as well as usually on the sample. Only when an inference is made about the population based on information obtained from the sample does the study become inferential.
The major characteristics of statistics: (i) Statistics are the aggregates of facts. It means a single figure is not statistics. For example, national income of a country for a single year is not statistics but the same for two or more years is statistics. (ii) Statistics are affected by a number of factors. For example, sale of a product depends on a number of factors such as its price, quality, competition, the income of the consumers, and so on. (iii)Statistics must be reasonably accurate. Wrong figures, if analysed, will lead to erroneous conclusions. Hence, it is necessary that conclusions must be based on accurate figures. (iv) Statistics must be collected in a systematic manner. If data are collected in a haphazard manner, they will not be reliable and will lead to misleading conclusions. (v) Collected in a systematic manner for a pre-determined purpose (vi) Lastly, Statistics should be placed in relation to each other. If one collects data unrelated to each other, then such data will be confusing and will not lead to any logical conclusions. Data should be comparable over time and over space.
Apart from the methods comprising the scope of descriptive and inferential branches of statistics, statistics also consists of methods of dealing with a few other issues of specific nature. Since these methods are essentially descriptive in nature, they have been discussed here as part of the descriptive statistics. These are mainly concerned with the following: (i) It often becomes necessary to examine how two paired data sets are related. For example, we may have data on the sales of a product and the expenditure incurred on its advertisement for a specified number of years. Given that sales and advertisement expenditure are related to each other, it is useful to examine the nature of relationship between the two and quantify the degree of that relationship. As this requires use of appropriate statistical methods, these falls under the purview of what we call regression and correlation analysis. (ii) Situations occur quite often when we require averaging (or totaling) of data on prices and/or quantities expressed in different units of measurement. For example, price of cloth may be quoted per meter of length and that of wheat per kilogram of weight. Since ordinary methods of totaling and averaging do not apply to such price/quantity data, special techniques needed for the purpose are developed under index numbers.
(iii) Many a time, it becomes necessary to examine the past performance of an activity with a view to determining its future behavior. For example, when engaged in the production of a commodity, monthly product sales are an important measure of evaluating performance. This requires compilation and analysis of relevant sales data over time. The more complex the activity, the 11 more varied the data requirements. For profit maximizing and future sales planning, forecast of likely sales growth rate is crucial. This needs careful collection and analysis of past sales data. All such concerns are taken care of under time series analysis. (iv) Obtaining the most likely future estimates on any aspect(s) relating to a business or economic activity has indeed been engaging the minds of all concerned. This is particularly important when it relates to product sales and demand, which serve the necessary basis of production scheduling and planning. The regression, correlation, and time series analyses together help develop the basic methodology to do the needful. Thus, the study of methods and techniques of obtaining the likely estimates on business/economic variables comprises the scope of what we do under business forecasting.
There are three major functions in any business enterprise in which the statistical methods are useful. These are as follows: i) The planning of operations: This may relate to either special projects or to the recurring activities of a firm over a specified period. ii) The setting up of standards: This may relate to the size of employment, volume of sales, fixation of quality norms for the manufactured product, norms for the daily output, and so forth. iii) The function of control: This involves comparison of actual production achieved against the norm or target set earlier. In case the production has fallen short of the target, it gives remedial measures so that such a deficiency does not occur again.
Different authors have highlighted the importance of Statistics in business. For instance, Croxton and Cowden give numerous uses of Statistics in business such as project planning, budgetary planning and control, inventory planning and control, quality control, marketing, production and personnel administration. Within these also they have specified certain areas where Statistics is very relevant. Another author, Irwing W. Burr , dealing with the place of statistics in an industrial organisation, specifies a number of areas where statistics is extremely useful. These are: customer wants and market research, development design and specification, purchasing, production, inspection, packaging and shipping, sales and complaints, inventory and maintenance, costs, management control, industrial engineering and research.
Classification condenses the data by dropping out unnecessary details. It facilitates comparison between different sets of data clearly showing the different points of agreement and disagreement. It enables us to study the relationship between several characteristics and make further statistical treatment like tabulation, etc.
Eg. During population census, people in the country are classified according to sex (males/ females), marital status (married/unmarried), place of residence (rural/urban), Age (0– 5 years, 6– 10 years, 11–15 years, etc.), profession (agriculture, production, commerce, transport, doctor, others), residence in states (West Bengal, Bihar, Mumbai, Delhi, etc.), etc.
Data: Data refers to the observations of variables. There are some key terms that we need to know before we can properly examine data. The objective of statistics is to extract information from data.
Types of data : Data can either be categorical/qualitative or numerical/quantitative
Measurements of data:
Nominal level: Nominal data are items that are differentiated by a naming system. The names refer to different characteristics that something can take on. Examples are things such as eye color, countries or names of people. Data at the nominal level are qualitative. It does not make sense to calculate something like the mean or standard deviation of nominal data.
Ordinal level : Ordinal data are data that have an order (nominal data do not). For example, placement in a race - first, second, et. - has an order, but no meaning can be given to the difference in the placements; i.e. the placements cannot be used to answer the question of ‟how much more?‟
Interval level: In the case of interval scaled data, the idea of difference does have meaning, but it does not have starting point. The most commonly cited example is temperature. 30◦C is 20◦C warmer than 10◦C, but 0 ◦C does not mean that there is no temperature.
Ratio level: For data at the ratio level the difference between two values makes sense and there is a starting point. For example between 10 and 30km there is a difference of 20km and the idea of 0km has meaning in the sense that it is the absence of distance.
Nominal is the lowest level. Only names are meaningful here. Ordinal adds an order to the names. Interval adds meaningful differences Ratio adds a zero so that ratios are meaningful.
Data: Numbers or measurements that are collected as a result of observations. Array: An array is a systematic arrangement of objects, usually in rows and columns. Data Array: Observations that are systematically arranged. An arrangement of data in ascending or descending order is called an array.
There are four types of classification, viz., (i) Qualitative; (ii) Quantitative; (iii) Temporal and (iv) Spatial (i) Qualitative classification: It is done according to attributes or non-measurable characteristics; like social status, sex, nationality, occupation, etc. For example, the population of the whole country can be classified into four categories as married, unmarried, widowed and divorced. When only one attribute, e.g., sex, is used for classification, it is called simple classification. When more than one attributes, e.g., deafness, sex and religion, are used for classification, it is called manifold classification. (ii) Quantitative classification: It is done according to numerical size like weights in kg or heights in cm. Here we classify the data by assigning arbitrary limits known as class- limits. The quantitative phenomenon under study is called a variable. For example, the population of the whole country may be classified according to different variables like age, income, wage, price, etc. Hence this classification is often called „classification by variables‟. (a) Variable: A variable in statistics means any measurable characteristic or quantity which can assume a range of numerical values within certain limits, e.g., income, height, age, weight, wage, price, etc. A variable can be classified as either discrete or continuous. (1) Discrete variable: A variable which can take up only exact values and not any fractional values, is called a „discrete‟ variable. Number of workmen in a factory, members of a family, students in a class, number of births in a certain year, number of telephone calls in a month, etc., are examples of discrete-variable.
via an online poll. For the survey method, the response rate for the survey is crucial, as a low response rate can destroy the validity of any conclusion resulting from the statistical analysis. Some Advantages of using Primary data :
Secondary Data: Secondary data refers to information that has already been collected by some other person or organization Examples include:- Data collection from books Newspaper
Published data Published data refers to data from secondary sources, and the data is readily available. This method of collecting data is preferred due to its convenience, relatively low cost, and its reliability (assuming it‟s been collected by a reputable organization). However when using secondary data, care needs to be taken as errors may have been introduced as a result of a false transcription or a misinterpretation. Some Advantages of using Secondary data:
If the value of a variable, e.g., height, weight, etc. (continuous), number of students in a class, readings of a taxi-meter (discrete) etc., occurs twice or more in a given series of observations, then the number of occurrence of the value is termed as the “frequency” of that value. The way of tabulating a pool of data of a variable and their respective frequencies side by side is called a „frequency distribution‟ of those data. Croxton and Cowden defined frequency distribution as “a statistical table which shows the sets of all distinct values of the variable arranged in order of magnitude, either individually or in groups, with their corresponding frequencies side by side”. Frequency: It is the number of observations following in some class. Frequency Distribution: It is a listing of classes and their frequencies.
Classes: There is no thumb rule about the number of classes, but roughly it is suggested that the number of classes should be between 8 to 15. Inclusive Classification: It is used for the classes like 55-57, 58-60, ……, 68-70. Both end points of the classes are included in the respective classes. Exclusive Classification: It is used for the classes like 10-20, 20-30,……50-60. The 2 nd^ end point is not included in the respective classes.
(a) Class-limits: The maximum and minimum values of a class-interval are called upper class limit and lower class-limit respectively (b) Class-mark, or, Mid-value: The class-mark, or, mid-value of the class-interval lies exactly at the middle of the class-interval and is given by:
(c) Class boundaries: Class boundaries are the true-limits of a class interval. It is associated with grouped frequency distribution, where there is a gap between the upper class-limit and the lower class-limit of the next class. This can be determined by using the formula:
Common width of a class-interval = difference between two successive upper Class-limits (or, two successive lower class-limits) (when the class-intervals have equal widths) = difference between two successive upper class-boundaries (or, two successive lower class boundaries) = difference between two successive class marks, or, mid values
where, N = total no. of observations in the data (Formula suggested by M.A. Sturges)
Frequency Distribution Methods Entry Table: By listing the actual observations Tally Sheet: By using a tally column
Frequency distribution is divided into several kinds also due to nature of raw data. Much useful information can be inferred from the frequency distribution table; therefore, frequency distribution table can be presented in proper and useful manner. Following are the various types of frequency distribution:
1. Frequency Distribution for Discrete Data The class limits in discrete data are the true class limits and there will be no class boundaries because discrete data are not in fractions. For example; following figures represents number of children born to 50 women in a certain locality up to the age of 40 years.
the class interval size of 1.
- 0 // Number of Children Tally Marks Number of Women - 1 // - 2 //// - 3 //// - 4 //// /// - 5 //// //// - 6 //// / - 7 //// - 8 /// - 9 //// 4. Relative Cumulative Frequency Distribution The cumulative frequency of a class divided by the total frequency is called relative cumulative frequency. It is also called percentage cumulative frequency since it is expressed in percentage. The table showing relative cumulative frequencies is called the relative cumulative frequency distribution or percentage cumulative frequency distribution.
Presentation of Statistical Data
Statistical data can be presented in three different ways: (a) Textual presentation, (b) Tabular presentation, and (c) Graphical presentation.
(a)Textual presentation: This is a descriptive form. The following is an example of such a presentation of data about
deaths from industrial diseases in Great Britain in 1935–39 and 1940– 44.
Example : Numerical data with regard to industrial diseases and deaths there form in Great
Britain during the years 1935–39 and 1940–44 are given in a descriptive form: “During the
quinquennium 1935–39, there were in Great Britain 1, 775 cases of industrial diseases made up of 677 cases of lead poisoning, 111 of other poisoning, 144 of anthrax, and 843 of
gassing. The number of deaths reported was 20 p.c. of the cases for all the four diseases
taken together, that for lead poisoning was 135, for other poisoning 25 and that for anthrax
was 30.
During the next quin quennium, 1940–44, the total number of cases reported was 2,
cases increased by 784 between the two periods. The number of deaths reported decreased
by 45 for lead poisoning, but decreased only by 2 for anthrax from the pre-war to the post- war quinquennium. In the later period, 52 deaths were reported for poisoning other than
lead poisoning. The total number of deaths reported in 1940–44 including those from
gassing was 64 greater than in 1935–39”.
The disadvantages of textual presentation are: (i) It is too lengthy; (ii) There is repetition of words; (iii) Comparisons cannot be made easily; (iv) It is difficult to get an idea and take appropriate action.
(b) Tabular presentation, or, Tabulation: “The process of arranging data into rows and columns is called tabulation” A table is a systematic arrangement of data into vertical column and horizontal rows. Tabulation of data on population of a country can by classified on the basis of religion, gender or marital status. Tabulation may be simple, double, triple or complex depending on the nature of classification, which is being used by the statistician. Tabulation may be defined as the systematic presentation of numerical data in rows or/and columns according to certain characteristics. It expresses the data in concise and attractive form which can be easily understood and used to compare numerical figures.
The descriptive form of Example has been condensed below in the form of a Table.
Types of Tabulation In general, the tabulation is classified in two parts, that is a simple tabulation, and a complex tabulation. Simple tabulation, gives information regarding one or more independent questions. Complex tabulation gives information regarding two mutually dependent questions. ONE-WAY TABLE DIVISION
Karachi Hyderabad Sukkur
POPULATION (Millions)
This table gives us information regarding one characteristic information about the population in different divisions of Sindh.
All questions that can be answered in ONE WAY TABLE are independent of each other. It is therefore an example of a simple tabulation, since the information obtained in it is regarding one independent question, that is the number of persons in various divisions of Sindh in millions.
Two-Way Table These types of table give information regarding two mutually dependent questions. For example, question is, how many millions of the persons are in the Divisions; the One-Way Table will give the answer. But if we want to know that in the population number, who are in the majority, male, or female. The Two-Way Tables will answer the question by giving the column for female and male. Thus the table showing the real picture of divisions sex wise is as under: TWO-WAY TABLE DIVISION POPULATION (Millions) Male Female Total Karachi Hyderabad Sukkur
Three-Way Table Three-Way Table gives information regarding three mutually dependent and inter-related questions. For example, from one-way table, we get information about population, and from two-way table, we get information about the number of male and female available in various divisions. Now we can extend the same table to a three way table, by putting a question, “How many male and female are literate?” Thus the collected statistical data will show the following, three mutually dependent and inter-related questions:
THREE-WAY TABLE DIVISION POPULATION (Millions) Male Female Total
Karachi Hyderabad Sukkur
Literate Illiterate Total Literate Illiterate Total Literate Illiterate Total
This table gives information concerning the literacy of both male and female in various divisions of Sindh. From the table we can explain the sex which has more education in relation to division, and also, we can say whether literacy is low in rural areas than in urban areas.
Higher Order Tables Higher order tables are those which provide information about a large number of inter related questions. Higher order tables may be of four-way, five-way, six-way etc. Such kind of tables are called manifold tables.
The main objectives of tabulation are stated below: (i) To carry out investigation; (ii) To do comparison; (iii) To locate omissions and errors in the data; (iv) To use space economically; (v) To study the trend; (vi) To simplify data; (vii) To use it as future reference.
Sorting: Sorting of data is the last process of tabulation. It is a time-consuming process
when the data is too large.
After classification the data may be sorted using either of the following methods: (i) Manual method: Here the sorting is done by hand by giving tally marks for the number of times each event has occurred. Next the total tally marks are counted. The method is simple and suitable for limited data. (ii) Mechanical and electrical method: To reduce the sorting time mechanical devices may be used. This is described as mechanical tabulation. For electrical tabulation data should be codified first and then punched on card. For each data a separate card is used. The punched cards are checked by a machine called „verifier‟. Next the cards are sorted out into different groups as desired by a machine called „sorter‟. Finally, the tabulation is done by using a tabulator. The same card may be sorted out more than once for completing tables under different titles.