Использование MapReduce для подбора модели логистической регрессии

Попробовать в MATLAB

В этом примере показано, как использовать mapreduce проведение простой логистической регрессии с использованием одного предиктора. Он демонстрирует цепочку нескольких mapreduce вызывается для выполнения итерационного алгоритма. Поскольку каждая итерация требует отдельного прохода через данные, анонимная функция передает информацию от одной итерации к следующей, чтобы передать информацию непосредственно в преобразователь.

Подготовка данных

Создайте datastore с помощью airlinesmall.csv набор данных. Этот 12-мегабайтный набор данных содержит 29 столбцов информации о рейсе для нескольких авиаперевозчиков, включая время прибытия и вылета. В этом примере интересующие вас переменные ArrDelay(задержка прибытия рейса) и Distance (общая дальность рейса).

ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.SelectedVariableNames = {'ArrDelay', 'Distance'};

datastore лечит 'NA' значения как отсутствующие и заменяет отсутствующие значения на NaN значения по умолчанию. Кроме того, SelectedVariableNames свойство позволяет работать только с заданными интересующими переменными, которые можно проверить используя preview.

preview(ds)

ans=8×2 table
    ArrDelay    Distance
    ________    ________

        8         308   
        8         296   
       21         480   
       13         296   
        4         373   
       59         308   
        3         447   
       11         954

Выполните логистическую регрессию

Логистическая регрессия является способом моделирования вероятности события как функции другой переменной. В этом примере логистическая регрессия моделирует вероятность того, что рейс опоздает более чем на 20 минут в зависимости от расстояния полета, в тысячах миль.

Чтобы выполнить эту логистическую регрессию, функции map и reduce должны коллективно выполнить взвешенную регрессию методом наименьших квадратов на основе значений текущих коэффициентов. mapper вычисляет взвешенную сумму квадратов и перекрестного продукта для каждого блока входных данных.

Отобразите файл функции map.

function logitMapper(b,t,~,intermKVStore)
  % Get data input table and remove any rows with missing values
  y = t.ArrDelay;
  x = t.Distance;
  t = ~isnan(x) & ~isnan(y);
  y = y(t)>20;                 % late by more than 20 min
  x = x(t)/1000;               % distance in thousands of miles

  % Compute the linear combination of the predictors, and the estimated mean
  % probabilities, based on the coefficients from the previous iteration
  if ~isempty(b)
    % Compute xb as the linear combination using the current coefficient
    % values, and derive mean probabilities mu from them
    xb = b(1)+b(2)*x;
    mu = 1./(1+exp(-xb));
  else
    % This is the first iteration. Compute starting values for mu that are
    % 1/4 if y=0 and 3/4 if y=1. Derive xb values from them.
    mu = (y+.5)/2;
    xb = log(mu./(1-mu)); 
  end

  % To perform weighted least squares, compute a sum of squares and cross
  % products matrix:
  %      (X'*W*X) = (X1'*W1*X1) + (X2'*W2*X2) + ... + (Xn'*Wn*Xn),
  % where X = [X1;X2;...;Xn]  and  W = [W1;W2;...;Wn].
  %
  % The mapper receives one chunk at a time and computes one of the terms on
  % the right hand side. The reducer adds all of the terms to get the
  % quantity on the left hand side, and then performs the regression.
  w = (mu.*(1-mu));                  % weights
  z = xb + (y - mu) .* 1./w;         % adjusted response

  X = [ones(size(x)),x,z];           % matrix of unweighted data
  wss = X' * bsxfun(@times,w,X);     % weighted cross-products X1'*W1*X1

  % Store the results for this part of the data.
  add(intermKVStore, 'key', wss);
end

Редуктор вычисляет оценки коэффициентов регрессии из сумм квадратов и перекрестных продуктов.

Отобразите файл функции сокращения.

function logitReducer(~,intermValIter,outKVStore)
  % We will operate over chunks of the data, updating the count, mean, and
  % covariance each time we add a new chunk
  old = 0;

  % We want to perform weighted least squares. We do this by computing a sum
  % of squares and cross products matrix
  %      M = (X'*W*X) = (X1'*W1*X1) + (X2'*W2*X2) + ... + (Xn'*Wn*Xn)
  % where X = X1;X2;...;Xn]  and  W = [W1;W2;...;Wn].
  %
  % The mapper has computed the terms on the right hand side. Here in the
  % reducer we just add them up.

  while hasnext(intermValIter)
    new = getnext(intermValIter);
    old = old+new;
  end
  M = old;  % the value on the left hand side

  % Compute coefficients estimates from M. M is a matrix of sums of squares
  % and cross products for [X Y] where X is the design matrix including a
  % constant term and Y is the adjusted response for this iteration. In other
  % words, Y has been included as an additional column of X. First we
  % separate them by extracting the X'*W*X part and the X'*W*Y part.
  XtWX = M(1:end-1,1:end-1);
  XtWY = M(1:end-1,end);

  % Solve the normal equations.
  b = XtWX\XtWY;

  % Return the vector of coefficient estimates.
  add(outKVStore, 'key', b);
end

Запуск MapReduce

Выполняйте mapreduce итеративно путем заключения вызовов в mapreduce в цикле. Цикл запускается до тех пор, пока не будут достигнуты критерии сходимости, с максимум пятью итерациями.

% Define the coefficient vector, starting as empty for the first iteration.
b = []; 

for iteration = 1:5
    b_old = b;
    iteration
    
    % Here we will use an anonymous function as our mapper. This function
    % definition includes the value of b computed in the previous
    % iteration.
    mapper = @(t,ignore,intermKVStore) logitMapper(b,t,ignore,intermKVStore);
    result = mapreduce(ds, mapper, @logitReducer, 'Display', 'off');
    
    tbl = readall(result);
    b = tbl.Value{1}
    
    % Stop iterating if we have converged.
    if ~isempty(b_old) && ...
       ~any(abs(b-b_old) > 1e-6 * abs(b_old))
       break
    end
end

iteration = 1

iteration = 2

iteration = 3

iteration = 4

Посмотреть Результаты

Используйте полученные оценки коэффициентов регрессии, чтобы построить график кривой вероятностей. Эта кривая показывает вероятность опоздания рейса более чем на 20 минут как функцию от рейса расстояния.

xx = linspace(0,4000);
yy = 1./(1+exp(-b(1)-b(2)*(xx/1000)));
plot(xx,yy); 
xlabel('Distance');
ylabel('Prob[Delay>20]')

Figure contains an axes. The axes contains an object of type line.

Локальные функции

Вот карта и сокращение функций, которые mapreduce применяется к данным.

function logitMapper(b,t,~,intermKVStore)
  % Get data input table and remove any rows with missing values
  y = t.ArrDelay;
  x = t.Distance;
  t = ~isnan(x) & ~isnan(y);
  y = y(t)>20;                 % late by more than 20 min
  x = x(t)/1000;               % distance in thousands of miles

  % Compute the linear combination of the predictors, and the estimated mean
  % probabilities, based on the coefficients from the previous iteration
  if ~isempty(b)
    % Compute xb as the linear combination using the current coefficient
    % values, and derive mean probabilities mu from them
    xb = b(1)+b(2)*x;
    mu = 1./(1+exp(-xb));
  else
    % This is the first iteration. Compute starting values for mu that are
    % 1/4 if y=0 and 3/4 if y=1. Derive xb values from them.
    mu = (y+.5)/2;
    xb = log(mu./(1-mu)); 
  end

  % To perform weighted least squares, compute a sum of squares and cross
  % products matrix:
  %      (X'*W*X) = (X1'*W1*X1) + (X2'*W2*X2) + ... + (Xn'*Wn*Xn),
  % where X = [X1;X2;...;Xn]  and  W = [W1;W2;...;Wn].
  %
  % The mapper receives one chunk at a time and computes one of the terms on
  % the right hand side. The reducer adds all of the terms to get the
  % quantity on the left hand side, and then performs the regression.
  w = (mu.*(1-mu));                  % weights
  z = xb + (y - mu) .* 1./w;         % adjusted response

  X = [ones(size(x)),x,z];           % matrix of unweighted data
  wss = X' * bsxfun(@times,w,X);     % weighted cross-products X1'*W1*X1

  % Store the results for this part of the data.
  add(intermKVStore, 'key', wss);
end
%-----------------------------------------------------------------------------
function logitReducer(~,intermValIter,outKVStore)
  % We will operate over chunks of the data, updating the count, mean, and
  % covariance each time we add a new chunk
  old = 0;

  % We want to perform weighted least squares. We do this by computing a sum
  % of squares and cross products matrix
  %      M = (X'*W*X) = (X1'*W1*X1) + (X2'*W2*X2) + ... + (Xn'*Wn*Xn)
  % where X = X1;X2;...;Xn]  and  W = [W1;W2;...;Wn].
  %
  % The mapper has computed the terms on the right hand side. Here in the
  % reducer we just add them up.

  while hasnext(intermValIter)
    new = getnext(intermValIter);
    old = old+new;
  end
  M = old;  % the value on the left hand side

  % Compute coefficients estimates from M. M is a matrix of sums of squares
  % and cross products for [X Y] where X is the design matrix including a
  % constant term and Y is the adjusted response for this iteration. In other
  % words, Y has been included as an additional column of X. First we
  % separate them by extracting the X'*W*X part and the X'*W*Y part.
  XtWX = M(1:end-1,1:end-1);
  XtWY = M(1:end-1,end);

  % Solve the normal equations.
  b = XtWX\XtWY;

  % Return the vector of coefficient estimates.
  add(outKVStore, 'key', b);
end
%-----------------------------------------------------------------------------

См. также

mapreduce | tabularTextDatastore

Документация MATLAB

Поддержка

Сообщество Экспонента

Документация