I'm trying to write exercises in which students eyeball the equation of a best fit line to a set of data.

Their answers will vary, so the answer checker needs to allow tolerance.

I want the student to enter a linear equation in x and y (not in a specific form) as an answer.

I tried using ImplicitPlane to define the equation,

$N = [-$m,1,0];

$ans = ImplicitPlane($N,$b);

ANS($ans->cmp(tolType=>'relative', tolerance=>1));

But when $m was 19.8957 and $b=121, a submitted answer of y=19.9x + 121 was marked incorrect with the comment that it was equivalent to the (correct) one I had previously submitted.

I might fall back on asking separately for m and b, rather than for a single equation.

### estimating equation of a line

by Bruce Yoshiwara -
In reply to Bruce Yoshiwara
Monday, 3 January 2011, 1:23 PM

### Re: estimating equation of a line

by Paul Pearson -
Hi Bruce,

Finding an equation for a line that best fits data is (obviously) very subjective. The easy thing to try is exactly what you did: change the tolerances. However, statisticians may not like this easy method. Your question is asking students to interpolate a best fit line, but your answer checker may actually be evaluating whether your student's answer extrapolates well. What I mean is, if your data points (x,y) all have x-coordinates between 100 and 110, but your answer checker evaluates the student's function on the (default) interval -1<= x <= 1, then you're really checking whether your student's answer extrapolates well (i.e., did they come close to finding the y-intercept of the best fit line). To correct this, set the limits for answer evaluation to be 100 < x < 110. Here's the reference for how to do this:

http://webwork.maa.org/wiki/FormulaTestPoints

Statisticians would probably prefer an approach that uses a linear equation that best fits the data using a least squares method, and then compares the variance of the best fit line to the variance of the student answer. (I'm not a statistician, so anyone please correct me if I'm using incorrect terminology or a faulty method.) Such an approach is given below. If your data sets are large, let me know and I'll email you a version that uses perl functions (instead of MathObject functions) and should be much faster.

Best Regards,

Paul Pearson

##################

# Initialization

DOCUMENT();

loadMacros(

"PGstandard.pl",

"MathObjects.pl",

);

TEXT(beginproblem());

##################

# Setup

Context("Numeric");

@x = (1,2,3,4);

@y = (3,6,7,9);

sub least_squares_linear_fit {

my $x = @{shift};

my $y = @{shift};

# From equation 10 at http://mathworld.wolfram.com/LeastSquaresFitting.html

#

# The best fit line b + m x for n data points (x_i,y_i) satisfies the matrix equation

#

# [b] = [ n Sum x_i ]^{-1} [ Sum y_i ]

# [m] [ Sum x_i Sum x_i^2 ] [ Sum x_i*y_i ]

my $n = scalar(@x);

my $n1 = $n - 1;

my $sumx = 0; foreach my $i (0..$n1) { $sumx += $x[$i]; }

my $sumxx = 0; foreach my $i (0..$n1) { $sumxx += ($x[$i])**2; }

# The vector on the right

my $sumy = 0; foreach my $i (0..$n1) { $sumy += $y[$i]; }

my $sumxy = 0; foreach my $i (0..$n1) { $sumxy += ($x[$i])*($y[$i]); }

my $det = ($n)*($sumxx)-($sumx)**2;

my $b = 1/$det * ($sumxx * $sumy -($sumx) * $sumxy);

my $m = 1/$det * (-($sumx) * $sumy + $n * $sumxy);

# correlation coefficient from http://mathbits.com/mathbits/tisection/statistics2/correlation.htm

my $sumyy = 0; foreach my $i (0..$n1) { $sumyy += ($y[$i])*($y[$i]); }

my $correlation = ($n * ($sumxy) -($sumx * $sumy) ) / ( sqrt($n * $sumxx -($sumx)**2) * sqrt($n * $sumyy -($sumy)**2) );

# from equation 32 at http://mathworld.wolfram.com/LeastSquaresFitting.html

my @e = (); foreach my $i (0..$n-1) { $e[$i] = $y[$i] -($b + $m * $x[$i]); } # error

my $sumee = 0; foreach my $i (0..$n-1) { $sumee += ($e[$i])**2; }

my $variance = sqrt($sumee / ($n-2));

return ($m,$b,$correlation,$variance);

}

@fit = least_squares_linear_fit( \@x, \@y);

$f = Formula("$fit[0] * x + $fit[1]")->reduce;

$cor = Real("$fit[2]");

$var = Real("$fit[3]");

#########################

# Main text

Context()->texStrings;

BEGIN_TEXT

The least squares best linear fit for

\( ($x[0],$y[0]) \),

\( ($x[1],$y[1]) \),

\( ($x[2],$y[2]) \),

\( ($x[3],$y[3]) \)

is \( y = $f \) with correlation \( $cor \)

and variance \( $var \).

$PAR

Enter your own linear approximation.

$BR

y = \{ ans_rule(20) \}

END_TEXT

Context()->normalStrings;

#########################

# Answer evaluation

$showPartialCorrectAnswers = 1;

ANS( $f->cmp( checker=>sub {

my ( $correct, $student, $ansHash ) = @_;

my @ystu = ();

foreach my $i (0..$#x) {

$ystu[$i] = $student->eval(x=>$x[$i]);

}

my $n = scalar(@x);

my $n1 = $n - 1;

# student error: array of height differences between data point and student's line

my @estu = (); foreach my $i (0..$n1) { $estu[$i] = $y[$i] -($student->eval(x=>$x[$i])); }

# sum of the squares of the student error

my $sumeestu = 0; foreach my $i (0..$n1) { $sumeestu += ($estu[$i])**2; }

# a variance calculation for the student's answer

my $varstu = sqrt($sumeestu / ($n-2));

# compare the variance $var of the best fit line to the variance of the student's answer $varstu

if ( abs($var-$varstu) < 0.1 ) { return 1; } else { return 0; }

})

);

ENDDOCUMENT();

Finding an equation for a line that best fits data is (obviously) very subjective. The easy thing to try is exactly what you did: change the tolerances. However, statisticians may not like this easy method. Your question is asking students to interpolate a best fit line, but your answer checker may actually be evaluating whether your student's answer extrapolates well. What I mean is, if your data points (x,y) all have x-coordinates between 100 and 110, but your answer checker evaluates the student's function on the (default) interval -1<= x <= 1, then you're really checking whether your student's answer extrapolates well (i.e., did they come close to finding the y-intercept of the best fit line). To correct this, set the limits for answer evaluation to be 100 < x < 110. Here's the reference for how to do this:

http://webwork.maa.org/wiki/FormulaTestPoints

Statisticians would probably prefer an approach that uses a linear equation that best fits the data using a least squares method, and then compares the variance of the best fit line to the variance of the student answer. (I'm not a statistician, so anyone please correct me if I'm using incorrect terminology or a faulty method.) Such an approach is given below. If your data sets are large, let me know and I'll email you a version that uses perl functions (instead of MathObject functions) and should be much faster.

Best Regards,

Paul Pearson

##################

# Initialization

DOCUMENT();

loadMacros(

"PGstandard.pl",

"MathObjects.pl",

);

TEXT(beginproblem());

##################

# Setup

Context("Numeric");

@x = (1,2,3,4);

@y = (3,6,7,9);

sub least_squares_linear_fit {

my $x = @{shift};

my $y = @{shift};

# From equation 10 at http://mathworld.wolfram.com/LeastSquaresFitting.html

#

# The best fit line b + m x for n data points (x_i,y_i) satisfies the matrix equation

#

# [b] = [ n Sum x_i ]^{-1} [ Sum y_i ]

# [m] [ Sum x_i Sum x_i^2 ] [ Sum x_i*y_i ]

my $n = scalar(@x);

my $n1 = $n - 1;

my $sumx = 0; foreach my $i (0..$n1) { $sumx += $x[$i]; }

my $sumxx = 0; foreach my $i (0..$n1) { $sumxx += ($x[$i])**2; }

# The vector on the right

my $sumy = 0; foreach my $i (0..$n1) { $sumy += $y[$i]; }

my $sumxy = 0; foreach my $i (0..$n1) { $sumxy += ($x[$i])*($y[$i]); }

my $det = ($n)*($sumxx)-($sumx)**2;

my $b = 1/$det * ($sumxx * $sumy -($sumx) * $sumxy);

my $m = 1/$det * (-($sumx) * $sumy + $n * $sumxy);

# correlation coefficient from http://mathbits.com/mathbits/tisection/statistics2/correlation.htm

my $sumyy = 0; foreach my $i (0..$n1) { $sumyy += ($y[$i])*($y[$i]); }

my $correlation = ($n * ($sumxy) -($sumx * $sumy) ) / ( sqrt($n * $sumxx -($sumx)**2) * sqrt($n * $sumyy -($sumy)**2) );

# from equation 32 at http://mathworld.wolfram.com/LeastSquaresFitting.html

my @e = (); foreach my $i (0..$n-1) { $e[$i] = $y[$i] -($b + $m * $x[$i]); } # error

my $sumee = 0; foreach my $i (0..$n-1) { $sumee += ($e[$i])**2; }

my $variance = sqrt($sumee / ($n-2));

return ($m,$b,$correlation,$variance);

}

@fit = least_squares_linear_fit( \@x, \@y);

$f = Formula("$fit[0] * x + $fit[1]")->reduce;

$cor = Real("$fit[2]");

$var = Real("$fit[3]");

#########################

# Main text

Context()->texStrings;

BEGIN_TEXT

The least squares best linear fit for

\( ($x[0],$y[0]) \),

\( ($x[1],$y[1]) \),

\( ($x[2],$y[2]) \),

\( ($x[3],$y[3]) \)

is \( y = $f \) with correlation \( $cor \)

and variance \( $var \).

$PAR

Enter your own linear approximation.

$BR

y = \{ ans_rule(20) \}

END_TEXT

Context()->normalStrings;

#########################

# Answer evaluation

$showPartialCorrectAnswers = 1;

ANS( $f->cmp( checker=>sub {

my ( $correct, $student, $ansHash ) = @_;

my @ystu = ();

foreach my $i (0..$#x) {

$ystu[$i] = $student->eval(x=>$x[$i]);

}

my $n = scalar(@x);

my $n1 = $n - 1;

# student error: array of height differences between data point and student's line

my @estu = (); foreach my $i (0..$n1) { $estu[$i] = $y[$i] -($student->eval(x=>$x[$i])); }

# sum of the squares of the student error

my $sumeestu = 0; foreach my $i (0..$n1) { $sumeestu += ($estu[$i])**2; }

# a variance calculation for the student's answer

my $varstu = sqrt($sumeestu / ($n-2));

# compare the variance $var of the best fit line to the variance of the student's answer $varstu

if ( abs($var-$varstu) < 0.1 ) { return 1; } else { return 0; }

})

);

ENDDOCUMENT();