





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
These notes provide a reference on probability theory that complements the lecture material. Probability theory is a mathematical language that allows us to speak with precision about how likely it is that a given process results in a given outcome. The document covers set theory, subset relation, power set, union, intersection, and functions between two sets. useful for students studying probability theory, mathematics, and humanities analytics.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






Probability is one of the most important mathematical concepts for humanities and cultural an- alytics. The goal of these notes is to provide a reference on probability theory that complements the lecture material. Our aim is to present probability theory in a way that is both rigorous (i.e., it leaves as little as possible imprecise and cuts as few corners as it can), and accessible (i.e., it does not assume prior knowledge of any mathematics beyond arithmetic). Probability theory is a mathematical language that allows us to speak with precision about how likely it is that a given process results in a given outcome. The term ‘process’ can be understood very broadly. Flipping a coin is a process, as is flipping 1,000 coins, as is measuring the temperature of a glass of water, as is writing a novel. For our purposes here, we’ll say that a process is anything that begins with some initial conditions and ends with an outcome, where we as the inquirers can divide things into initial conditions and outcomes however we want. So in the examples above, the initial conditions might be broadly described as:
The corresponding outcomes of interest might be:
These examples are not exhaustive; probability theory has a myriad of applications. But they should give you an idea of the sorts of uses that probability theory can and does have, including in the context of quantitative humanities research. In what follows, we’ll provide the necessary background to understand and develop some useful applications of probability theory.
The language of probability theory is actually a special application of the more general language that a lot of mathematical theories are written in: the language of set theory. One can spend their whole life studying set theory. Thankfully, we won’t. Instead, we’ll introduce only the minimal set-theoretic language needed to do some useful probability theory. Specifically, we’ll define:
Intuition can be a good guide to understanding some of these concepts. In the case of subset, union, and intersection, the terms mean more or less what you might think they mean from an understanding of ordinary English. Nevertheless, for the sake of rigor, we’ll define each more precisely.
We’ll begin by defining a set very generally.
Definition 1. A set is a collection of elements.
On its own, this doesn’t tell us much, because it raises an obvious question: what are elements? The simple answer is that elements are whatever we want them to be. They might be numbers, they might be letters or words, they might be shapes or symbols, and so on. For the sake of doing probability theory, we’ll often want to think of the elements of a set as symbols that denote possible outcomes of processes. How exactly symbols denote possible events in the world is a question that has kept philosophers of language from Bertrand Russell to Jacques Derrida very busy, but let’s ignore that for now and just accept that sets are collections of elements, where those elements can be whatever we want them to be. This flexibility is a virtue of set theory; because the elements of sets can be anything, we can use set theory to talk about a wide variety of subjects. Let us now introduce some standard set-theoretic notation, i.e., some symbols that allow us write efficiently in the language of set theory. This notation is purely conventional; there’s nothing about set theory that requires us to use this particular notation. But these symbols are so common that it’s worth being familiar with them. We’ll sometimes use italicized capital letters, like S, to refer to sets. We’ll also typically list the elements of a set in the brackets { } to indicate that those elements are all contained within a single set. So for instance, if the set S contains all and only the numbers 1 , 2 , 3 , and 4 , we can write this as S = { 1 , 2 , 3 , 4 }. We might also use Greek letters, like Ω or Σ, to refer to sets. So a set Ω containing all and only the letters h and t, which might stand for a heads or tails outcome of a coin toss, can be written as Ω = {h, t}. Finally, we might want to remain very agnostic about the nature of the elements of a set. In that case, we’ll use a numbered, lower-case letter to refer to the elements of a set. So we might have the set S = {s 1 , s 2 ,... , sn}. Each element si, where i is any number from 1 to n, denotes some element of the set, where that element can be anything at all. If we switch to Greek letters, we might do the same thing using the notation Ω = {ω 1 , ω 2 ,... , ωn}. Note that the choice to use Roman or Greek letters is also just a matter of convention; in different contexts it is typical to use one or the other, but there’s no real rhyme or reason behind it other than the cultural quirks of a particular application of set theory. We denote claims about set-membership using the symbols ∈ and 6 ∈. We read si ∈ S as ‘si is an element of S’, and si 6 ∈ S as ‘si is not an element of S’. The ordering of elements in a set does not matter. To illustrate, the set {a, b, c} is identical to the set {c, b, a}. This would be just as true if these sets contained numbers, shapes, or anything else instead of letters. Similarly, an element is only “counted” as a belonging to a set once. There’s no distinction, in basic set theory, between the set { 3 , 4 , 4 } and the set { 3 , 4 }. This would be just as true if we replaced 3 and 4 with letters, shapes, words, etc. Importantly, sets can be elements of sets. For example, the set S = {{ 1 , 2 }, { 3 , 4 }} has as its elements the sets { 1 , 2 } and { 3 , 4 }. Crucially, this does not mean that 1 , 2 , 3 , or 4 are elements of S. Indeed, they’re not. Rather, only the sets { 1 , 2 } and { 3 , 4 } are elements of S. The numbers 1 and 2 are elements of { 1 , 2 }, but not S, and 3 and 4 are elements of { 3 , 4 }, but not S. More generally, we say that set membership is not transitive. That is, if si ∈ S and S ∈ Σ, it does not follow that si ∈ Σ. This is an admittedly counter-intuitive part of set theory, so it may be worth reading this paragraph a few times, and then trying the following exercise:
Exercise 1. Let C be a set containing all the clubs in a standard deck of cards, and let D be a set containing all the diamonds in a standard deck of cards. Let S be a set defined so that S = {C, D}. Now answer the following questions (correct answers in footnote):
Remark 3. Any set is an element of its own power set. This follows from the fact that the power set of any set A is the set of all subsets of A, and the fact, demonstrated in Remark 1 above, that any set is a subset of itself.
Notice further that while 1 , 2 , 3 , and 4 are all elements of S, none of them are elements of ℘(S). Sets containing these integers are elements of ℘(S), but this is not the same thing as those integers being elements of S, because set membership is not transitive. This would be just as true if S did not contain integers but instead contained letters, shapes, words, or anything else. If this point does not make sense (and indeed, this part of set theory has the potential to be confusing), then we suggest returning to Exercise 1. Once you have mastered that, the idea that any element of S need not be an element of the power set of S should be clear. Notice as well that the empty set ∅ is an element of ℘(S). Here too, we can make a more general remark:
Remark 4. The empty set is an element of any power set. This follows from the fact that the power set of any set A is the set of all subsets of A, and the fact, demonstrated in Remark 2, that the empty set is a subset of any set.
For the sake of doing probability theory, we’ll often be interested in defining a set that contains either all the elements of both A and B, or all the elements shared by A and B, where A and B are themselves sets. This can be made more precise by defining the concept of the union and intersection of two sets. We begin by defining the union of any two sets:
Definition 4. The union of any two sets A and B is the set containing all and only those elements that are in either A or B.
Using symbols, we write A ∪ B for the union of A and B. To illustrate, the union of the sets of words A = {democracy, capitalism, oligarchy} and B = {democracy, religion, materialism} is:
A ∪ B = {democracy, capitalism, oligarchy, religion, materialism}.
Note that for any sets A or B, A ⊆ A ∪ B and B ⊆ A ∪ B. This is because any element of a set A is also an element of the union of B with any other set. Next, we define the intersection of any two sets:
Definition 5. The intersection of any two sets A and B is the set containing all and only those elements that are in both A and B.
Using symbols, we write A ∩ B for the intersection of A and B. To illustrate, the intersection of the sets of words A = {democracy, capitalism, oligarchy} and B = {democracy, religion, materialism} is A ∩ B = {democracy}. Note that for any sets A or B, A ∩ B ⊆ A and A ∩ B ⊆ B. This is because any element of the intersection A ∩ B is also an element of both A and B. If A and B share no elements, then the only set containing all and only those elements in both A and B is the set containing no elements, i.e., the empty set. In this case, we write A ∩ B = ∅.
A function between two sets is a mathematical object that takes the elements from one set and matches them to elements of a second set. This can be defined more formally as follows:
Definition 6. A function f from set A to set B is a relation that associates each ai ∈ A with an element f (ai) of B.
More rigorous definitions of a function are possible, but this will suffice for our purposes. Symbol- ically, we use lower-case letters to represent functions, and write f : A → B to denote that f is a function from A to B. Importantly, if f is a function from A to B, then f (ai) must be an element of B for every ai ∈ A. However, it is not the case that every element of B must have an associated element of A that is mapped to it. To illustrate, if we let A = {x, y, z} and B = {α, β, γ}, then we can define a function f : A → B such that f (x) = α, f (y) = α, and f (z) = γ. Alternatively,
we might define a function g : A → B such that g(x) = α, g(y) = β, and g(z) = γ. This shows that functions are not generally symmetric; a function from A to B does not necessarily define a function from B to A. You may recall from other coursework in mathematics that we sometimes summarize functions using equations. For instance, we can define a function f from the integers to the integers such that each integer x is mapped to its square. This is summarized by the equation f (x) = x^2. However, in what follows, we’ll define probability functions, which usually cannot be summarized in this way.
3 Probability Theory
Now for the fun part. Recall from the introduction that probability theory is mathematical lan- guage that allows us to speak precisely about the likelihood of any given outcome of some process. In what follows, we’ll introduce the crucial notion of a probability space, and then move on to the concept of conditional probability.
By the send of this subsection, we’ll define the all-important concept of a probability space. To get there, we’ll have to define some other notions first. Let Ω be a set containing all the possible outcomes of some process. We’ll assume for now that Ω has finitely many elements (i.e., that there is some integer n that is equal to the number of elements in Ω). While there are many instances in which one might want to consider processes that have an infinite set of possible outcomes, doing so makes probability theory much more complicated, so we’ll leave such cases aside for now. To illustrate, if the process we are modeling is a single roll of a die, then Ω might be the set { 1 , 2 , 3 , 4 , 5 , 6 }, where each number denotes the side of the die that shows after the roll. Next, consider the power set ℘(Ω). This is the set of all subsets of all possible outcomes of the process being modelled. In the case of the set Ω representing possible outcomes { 1 , 2 , 3 , 4 , 5 , 6 } of the die roll, the power set ℘(Ω) is:
℘(Ω) = {∅, { 1 }, { 2 }, { 3 }, { 4 }, { 5 }, { 6 }, { 1 , 2 }, { 1 , 3 }, { 1 , 4 }, { 1 , 5 }, { 1 , 6 }, { 2 , 3 }, { 2 , 4 }, { 2 , 5 }, { 2 , 6 }, { 3 , 4 }, { 3 , 5 }, { 3 , 6 }, { 4 , 5 }, { 4 , 6 }, { 5 , 6 }, { 1 , 2 , 3 }, { 1 , 2 , 4 }, { 1 , 2 , 5 }, { 1 , 2 , 6 }, { 1 , 3 , 4 }, { 1 , 3 , 5 }, { 1 , 3 , 6 }, { 1 , 4 , 5 }, { 1 , 4 , 6 }, { 1 , 5 , 6 }, { 2 , 3 , 4 }, { 2 , 3 , 5 }, { 2 , 3 , 6 }, { 2 , 4 , 5 }, { 2 , 4 , 6 }, { 2 , 5 , 6 }, { 3 , 4 , 5 }, { 3 , 4 , 6 }, { 3 , 5 , 6 }, { 4 , 5 , 6 }, { 1 , 2 , 3 , 4 }, { 1 , 2 , 3 , 5 }, { 1 , 2 , 3 , 6 }, { 1 , 2 , 4 , 5 }, { 1 , 2 , 4 , 6 }, { 1 , 2 , 5 , 6 }, { 1 , 3 , 4 , 5 }, { 1 , 3 , 4 , 6 }, { 1 , 3 , 5 , 6 }, { 1 , 4 , 5 , 6 }, { 2 , 3 , 4 , 5 }, { 2 , 3 , 4 , 6 }, { 2 , 3 , 5 , 6 }, { 3 , 4 , 5 , 6 }, { 1 , 2 , 3 , 4 , 5 }, { 1 , 2 , 3 , 4 , 6 }, { 1 , 2 , 3 , 5 , 6 }, { 1 , 2 , 4 , 5 , 6 }, { 1 , 3 , 4 , 5 , 6 }, { 2 , 3 , 4 , 5 , 6 }, { 1 , 2 , 3 , 4 , 5 , 6 }}.
Don’t worry about reading each element of this power set; we write it out in full here just to give you a sense of what its elements are. Taking stock, we now have a set Ω whose elements are the possible outcomes of a process, and a power set ℘(Ω) whose elements are sets of possible outcomes of that process. This puts us in a position to define a probability function:
Definition 7. A probability function P : ℘(Ω) → [0, 1] is a function from the set of sets of possible outcomes ℘(Ω) into the set of all real numbers between [0, 1], where p has the following properties:
This is a somewhat more involved definition than we’ve had so far, so we’ll break it down piece- by-piece, with examples. Hopefully, it will be satisfying to do so, because it will bring together most of the concepts we’ve defined so far. First, there is the definition of a probability function p as a function from ℘(Ω) into [0, 1]. To get a sense for what this means, consider first any S that is a subset of the set of possible outcomes
said probabilities. For instance, nothing in probability theory itself says that we should assign all outcomes of a die roll equal probability; this is an assumption we’ll have to justify by other means, if we can justify it at all. It may be that, for some reason, a die is weighted so that even outcomes are more likely than odd outcomes. If this were true, it might change the probability function that we’d want to use in a model of the die roll process. All that probability theory itself tells us is that, whatever probabilities we do assign to sets of possible outcomes, the probability function must satisfy the three properties listed in Definition 7.
3.1.1 A Caveat
At this stage, we have to make a confession. We have not told you the whole truth here. So now we will. In many contexts, one can use a probability space in which probabilities are not defined on the full power set of outcomes. In fact, when the set of possible outcomes is infinite, we may be required to define our probability space differently. So, if you choose to go on and study more advanced aspects of probability theory, then you’ll have to prepare to deviate somewhat from what we’ve presented above. At the same time, a lot of useful applications of probability theory can be done using just what we’ve presented so far, including everything that you will encounter in this course. Specifically, as long as the process that you’re studying has finitely many possible outcomes (even if there’s ten trillion of them), then you’ll be compliant with all mathematical rules if you just stick to the techniques we’ve presented here.
Often, we’ll want to change the probability that we assign to a given set of outcomes of a process once we learn some information about the actual outcome of that process. Returning to our example of a die roll, we might initially believe that the likelihood of the die roll resulting in an outcome where the die shows a 1 is 16. However, if we learn that the outcome of the die roll was such that the die showed an odd number, then we may wish to revise this belief, and instead say that the likelihood of the die showing a 1 is 13 (since we now now that the actual outcome must be either 1 , 3 , or 5 ). It turns out that the language of probability theory offers a very precise way of talking about this practice of changing or updating our beliefs. Suppose that we have a probability space consisting of a set of possible outcomes Ω, its power set ℘(Ω), and a probability function P : ℘(Ω) → [0, 1]. Let A and B be any two elements of the power set ℘(Ω), i.e., any two subsets of Ω. The conditional probability P (A|B) can be read as ‘the probability that the outcome of the process is an element of A, given that it is an element of B’. For instance, in the example given above, the claim ‘the probability that the outcome of the die roll is in { 1 }, given that it is in the set odd outcomes { 1 , 3 , 5 }, is 13 ’ can be written as P ({ 1 }|{ 1 , 3 , 5 }) = 13. In fact, if we already have a well-defined probability function over all the elements of a power set ℘(Ω), then we can calculate the value of any conditional probability. Letting A and B be any elements of ℘(Ω), the value of the conditional probability P (A|B) can be calculated using the formula
P (A|B) =
That is, the value of the conditional probability P (A|B) is the ratio between the probability assigned to the intersection A ∩ B and the probability assigned to the set of possible outcomes B. Thus, the equation above is often referred to as the “ratio formula” for prior probability. To illustrate using the example above, if we assume that all outcomes of a die roll are equally likely, then the probability of an outcome in the set { 1 }∩{ 1 , 3 , 5 } = { 1 } is 16 , whereas the probability that the outcome of the die roll is in the set { 1 , 3 , 5 } is 36 , or. 5. Using the ratio formula, we can calculate the value of the conditional probability P ({ 1 }|{ 1 , 3 , 5 }) as follows:
1 6 1 2
Thus, the probability of the outcome of the die roll being 1 , given that we know the outcome will be an odd number, is 13. Note that, for any set of possible outcomes B, when P (B) = 0, the conditional probability P (A|B) is undefined, since division is not defined when the denominator is zero. In more advanced applications of probability theory, one can explore alternative ways of
defining conditional probability which avoid this issue, but for our purposes here, we don’t need to worry about this, since we won’t assign probability zero to any events which might occur.
4 Application in Humanities Analytics
In this final section, we’ll show how one can use formal probability theory to set up a context of inquiry for a toy example of project in humanities analytics. The example we use will be very simplistic, so as to make it very clear how these concepts apply. Indeed, it may not be the case that an actual project in humanities analytics or digital humanities would ever be quite this simple. Suppose that we’re interested in the use of floral imagery and animal imagery in nineteenth- century British novels. To study this probabilistically, we’ll think of a generic nineteenth-century British novelist as a process that can result in one of four outcomes:
Let Ω be the set containing these four outcomes, and let ℘(Ω) be the power set of Ω, containing each of its subsets. Our goal is to define a probability function p over Ω. This, it must be noted, is the hard part. The simplest way to do it would be to go through every nineteenth-century British novel and count whether it contains either variety of imagery. This would enable us to calculate probabilities for the sets containing one and only one element of Ω using the following equations:
P ({Novel with floral and animal imagery}) =
Total # of novels
P ({Novel with floral but not animal imagery}) =
Total # of novels
P ({Novel with animal but not floral imagery}) =
Total # of novels
P ({Novel with neither animal nor floral imagery})
=
Total # of novels
As a practical matter, it will be impossible to calculate these four ratios with perfect historical accuracy. But through a careful mix of archival knowledge, text-processing technology, and (most importantly) expertise in literary study and history, it may be possible to arrive at decent estimates. Once we have probabilities assigned to these four sets, each of which contains one and only one element of Ω, we note that all other elements of the power set ℘(Ω) can be formed by taking the union of some combination of the four sets assigned probabilities above. So, via the third property of a probability function, we can use the four probabilities calculated above, along with simple addition, to calculate probabilities for every other element of the power set ℘(Ω). This allows us, in turn, to calculate conditional probabilities. For instance, suppose that we wanted to know the probability the a novel contains animal imagery, given that it contains floral imagery. Let A be the set of outcomes in which the novel produced has animal imagery, which is defined as follows:
A = {Novel with floral and animal imagery, Novel with animal but not floral imagery}.
Let F be the set of outcomes in which the novel produced has floral imagery, which is defined as follows:
F = {Novel with floral and animal imagery, Novel with floral but not animal imagery}.
Note that the intersection A ∩ F is defined as follows:
A ∩ F = {Novel with floral and animal imagery}.