The Biologist Is In: August 2017

Saturday, August 19, 2017

Significantly Fuzzy and Uncertain Math

I was always a very smart student, but I wasn't always a very good student. During lessons over the years, there would occasionally be little pieces that I would miss. Well, I either missed them or they simpler weren't taught. One of the earliest ones was about what the point of remainders were in doing division. I never once remembered a math teacher saying the remainder was the numerator and divisor was the denominator. When the schoolwork moved past remainders, I had to basically learn the math all over again because there was no apparent connection between what we were doing with what I had been taught before. Years later I was puzzling over what the point of that early math had been and I made the connection, filling in the gap in what I was taught. If someone is trying to teach me something and I can't integrate it into the knowledge I already have, it has always been extra difficult.

In high-school, I was taught about significant figures. Our pre-calculus teacher got in an argument with a student (not me) one day. She was adamant that, "0 was not the same as 0.000", but she didn't explain why. I always had the hardest time keeping the rules for significant figures straight during calculations. It was only in college that I finally understood that significant figures represent the level of uncertainty in a measurement. The idea that a numerical measurement was a distinct concept from the number that described the measurement was something of a novelty to me.

Those significant figures rules?

For addition & subtraction, the last significant figure for the calculated results should be the leftmost position of the last significant figure of all the measured numbers. Only the position of the last significant figure matters. [10.0 + 1.234 ≈ 11.2]
For multiplication & division, the significant figures for the calculated result should be the same as the measured number with the least significant figures. Only the number of significant figures matters. [1.234 × 2.0 ≈ 2.5]
For a base 10 logarithm, the result should have the same number of significant figures as the starting number in scientific notation. [log₁₀(3.000×10⁴) ≈ 4.4771]
For an exponentiation, the result should have the same number of significant figures as the fractional part of the starting number in scientific notation. [10^2.07918 ≈ 120.0]
Don't round to significant figures until the entire calculation is complete.

Lets see if we can convert these basic rules into something with a more statistical flavor. First we should define a way of writing uncertain numbers. lets define an example number 'x', which has a measured value of '2' and an uncertainty of ±1. If we consider the measurement to fit the Gaussian assumption, then that uncertainty would be the standard deviation.

x = (2±1)

If we add these two measurements together, with all their uncertainty, we'd expect an average value of 4 with some unknown standard deviation.

(2±1) + (2±1) = (4±[?])

Figure illustrating how arithmetic operations are performed on intervals. A=[-1,3], B=[1,5]. Top subfigure shows A+B=[0,8]. Bottom subfigure shows A-B=[-6,2].

[from link.]

We'll need to take a step back at this point. If you
If you go explore the topic of "fuzzy mathematics" on Wikipedia, you'll find some abstract discussion of set theory rather than something that seems like what we've been talking about here. If you do some searches for "fuzzy arithmetic", you'll get into a realm of math that is between the abstract set theory and something closer to what I'm looking for.

If you dig even further, you'll find Gaussian Fuzzy Numbers (GFN). This sounds very much like the sort of math I want. Two GFNs are added together to generate a new GFN in a two step process. The means of the two numbers are added to make the new mean. The standard deviations are added to make the new standard deviation. In the above notation, this would be:

(2±1) + (2±1) = (4±2)

This is a pretty straightforward rule, but it doesn't feel like it has the statistical flavor that I'm looking for.

Figure illustrating a simulation of adding two normal/gaussian distributions. Top - and middle-left subfigures show randomized distributions with a mean and standard deviation of 1. Bottom-left subfigure shows the result of adding the two distributions together, a new distribution with a mean of 2 and a standard deviation of sqrt(2). At right are two subfigures showing estimates for the distribution mean and standard deviation from numerous simulation repeats.

Method 1

How can we derive the standard deviation produced by adding two uncertain measurements? After thinking about it a bit, I thought of two methods to estimate what the value would be.

My first method basically simulates two uncertain measurements. I created a set of several thousand random samples within each initial Gaussian distribution, then iterated every possible pairwise addition between the two sets. I then calculated mean and standard deviation estimates from the set of pairwise additions. I repeated this estimation process a few thousand times and calculated the average values for the mean and standard deviation. With enough repetitions of this process, the estimates began to converge.

(2±1) + (2±1) = (3.9998±1.4146) ≈ (4±sqrt(2))

A figure showing an alternate method of deriving the result of adding together two gaussian distributions. Top and middle subfigure show a blue gaussian curve with a mean and standard deviation of 1. Bottom subfigure shows the result of adding every point from the first distribution/curve to every point of the second. The envelope, the upper bounds of the resulting set of points makes a new gaussian curve with a mean of 2 and a standard deviation of sqrt(2).

Method 2

That approach to estimating the new standard deviation takes a lot of calculations. My second method is much more efficient and converges faster. I started with two Gaussian curves, sampled at some high density. I then iterate through every combination of one point from first and second curves. For each combination, the two x-values were added to make a new x-value. The two y-values were multiplied to make a new y-value. (The y-values are probabilities. Multiplying the two probabilities calculates the probability for both happening at once.) Plot all those x/y value pairs (in light blue at left) and the envelope (or outline, roughly) of those points (shown in red) describes the same curve we calculated more roughly with my first method. I fitted the Gaussian distribution function to this curve to get the numerical estimate for it's standard deviation.

(1±1) + (1±1) = (2±1.4142) ≈ (2±sqrt(2))

Table from math textbook, showing specific calculations for addition/subtraction, multiplication, division, power, multiplication by a constant, and a generalized function of gaussians.

That seems a nice and simple relationship, but it is distinctly different than Gaussian Fuzzy Number calculation described previously would indicate. It took some further digging before I found a document on the topic of "propagation of uncertainties". The document included a nice table with a series of very useful relationships, describing how Gaussian uncertainties are combined by various different basic mathematical operations.

From these relationships, we can short-circuit around all the iterative calculations I've been playing with. If we have measurements with a non-Gaussian distribution, it might still be necessary to use the numerical estimation methods I came up with.

Figure illustrating addition of two gaussians by three different methods. Shows how significant figures calculations underestimates the expected resulting variation and how gaussian fuzzy number calculations over-estimate the expected resulting variation. Propagation of uncertainty calculations match the expectations from earlier simulation methods.

Lets compare the three methods for tracking uncertainty through calculations.

Significant figures: (1±0.5) + (1±0.5) = (2±0.5)
Gaussian fuzzy numbers: (1±0.5) + (1±0.5) = (2±1.0)
Propagation of uncertainties: (1±0.5) + (1±0.5) = (2±0.70711)

The significant figures method underestimates the uncertainty through the calculation, while the Gaussian fuzzy numbers approach overestimates the uncertainty. Both these methods do have the advantage of being simple to apply without requiring any detailed computation. However, the errors would probably accumulate through more extensive calculations. I'll have to play around with a few test cases later to illustrate this.

I didn't like significant figures when I was first taught about them. The rules struck me as somewhat arbitrary and the results didn't fit at all with my expectations of how numbers should behave. The lessons were always a stumbling point for me because of this disconnect.

Over the years since, I had occasionally played around with how to do it better. It was only recently that I figured out how to derive the solutions I described above and realized propagation of uncertainties was what I had been searching for. Those high-school lessons would have been so much more effective had they included the real math instead of assuming I couldn't handle the concepts.

References:

https://en.wikipedia.org/wiki/Significant_figures#Concise_rules
Fuzzy mathematics: en.wikipedia.org/wiki/Fuzzy_mathematics
Fuzzy arithmetic:

Calculating uncertainty:

www.wikihow.com/Calculate-Uncertainty

Propagation of uncertainties:

virgo-physics.sas.upenn.edu/uglabs/lab_manual/Error_Analysis.pdf

Tuesday, August 1, 2017

A Cross by Any Other Name

Figure illustrating how a recessive trait appears in F1, F2, and F3 generations after a cross. In F1, the trait is hidden. In F2, a quarter of individuals show the recessive trait. In F3, 3/16 of individuals show the recessive trait.

From [link].

I've been involved in a few discussions online lately about different types of crosses that can be used in plant breeding. There has been some mild confusion about basic terms, as well as about the implications of different types of crosses. A few years ago I wrote about backcrossing. Though that post is somewhat hard for me to read, as I imagine early writings are for most authors, it has some useful information. Here I'm going to try and do a more general overview. Lets see how this little ride goes.

Some of that basic terminology and common abbreviations:

P : Parental. An initial variety used in a cross. Multiple parents can be numbered, like in "p1 x p2".
F : Filial, relating to progeny generations after an initial cross. F1 is the initial hybrid. F2 is the result of crossing two F1s. F3 is the result of crossing two F2s, etc.
Self Cross : Crossing the male and female parts of the same plant.
BC : Back cross. Crossing a filial generation back to one of the parents.
CC : Complex cross. A cross involving more than two parents.

P : To simplify things, we usually use highly stable varieties as initial parents in a hybridization project. This means that several generations of each parent variety have been grown out without any visible variation appearing. At the basic genomic level, this means the varieties are highly homozygous. In theoretical cases we consider the parents to be absolutely homozygous, though reality is never quite so clear-cut.

F1 : Our initial hybrid between two parents can be written out in a bit longer form like "p1 x p2", or just referred to as an F1 between the two parents. In our idealized scenario, every F1 produced by crossing the same two parents will be identical. F1 stands for "first filial generation".

If a group of F1s aren't identical, this says one or both of the parents wasn't entirely homozygous. (Or new mutations were introduced, or epigenetic effects are at play, or etc. It can get complicated). Because they're (more or less) identical, selection usually isn't very important at this stage.

From [link].

F2 : Our second filial generation is produced by crossing two F1s together. For those plants that can self cross (like peppers and tomatoes), the F2s would generally be produced by crossing one F1 to itself. For those that can't (like tomatillos), the F2s would be produced by crossing two separate F1 siblings.

The F2 generation is where the different alleles from each parent are recombined. Almost any combination of traits from each parent can turn up in an individual among the F2s. This is where the magic happens in a plant breeding project really happens. This generation is where selection is most important.

F3...Fn : Subsequent filial generations would be produced in a similar way to the F2s. If you produced F3s by selfing an F2, each F3 will have about 50% of the heterozygosity of the F2. Selfing another generation will result in another 50% loss of heterozygosity. Continue this process for enough generations and you will have a new stable variety, with an essentially homozygous genome.

If you produced F3s by crossing random F2s, you'll keep mixing up the genetics instead of automatically losing 50% of the heterozygosity each generation. If you do this with relatively few plants, you will still be losing heterozygosity each generation, though calculating exactly how much becomes a bit complicated.

If you produced F3s by crossing specific F2s that had a trait you liked, you'll keep mixing up all the other genetics while selecting for that specific trait. You would be losing heterozygosity near the genes responsible for the trait of interest, but the rest of the genome would still be maintaining heterozygosity through generations.

BC : In basic back crossing, each subsequent generation past F1 is crossed back to one of the parents. BC1 would be diagrammed something like, "[p1 x p2] x p1" (or "F1 x p1"). For one hypothetical mutation found in the first parent, a BC1 individual would have a 50% chance of having two copies (and a 0% chance of having no copies) since it is assured of inheriting one copy from the parental strain used in the backcross.

Through each generation of back-crossing the resulting plants will lose 50% of their heterozygosity, but it will be replaced with whatever mutations are found in the parental strain. The result will end up more and more like the recurrent parent strain over the generations. If you do this randomly, you will end up with essentially a genetic clone of the recurrent parent. To get anything different, you have to persistently select for a trait that was originally only in the second parental variety. Doing this will eventually produce something almost exactly like the recurrent parent, but with the one trait that was originally in the other parent variety. (That's all detailed in the link I mentioned in the intro.)

CC : A complex cross involves three or more parental varieties. A simple case would be taking an F1 and crossing it to an independent F1, "[p1 x p2] x [p3 x p4]". In these scenarios you would get a very diverse population, just like with F2s, but the mutations contributed to the population can come from all four parent varieties.

A mutation that was found in only one of the parental strains would only be found in one copy in 25% of this mixed up population. If one of these plants was selfed, the chance of a plant being homozygous in the next generation is 6.25%.
If the plants were allowed to cross randomly, the chance of a plant being homozygous in the next generation drops to only 1.5625%. You would need to be working with very large numbers of plants to routinely recover double-recessives using this strategy. I strongly advise you not use this strategy.

References:

Genetics: sites.google.com/a/wisc.edu/ils202fall11/home/student-wikis/group8
Back cross: the-biologist-is-in.blogspot.com/2014/03/the-genetics-of-backcrossing.html
Cross types: agriinfo.in/default.aspx?page=topic&superid=3&topicid=1753
Monohybrid cross:

en.wikipedia.org/wiki/Monohybrid_cross

Dihybrid cross: