Подготовьте текстовые данные к анализу

Этот пример показывает, как создать функцию, которая чистит и предварительно обрабатывает текстовые данные для анализа.

Текстовые данные могут быть большими и могут содержать много шума, который негативно влияет на статистический анализ. Например, текстовые данные могут содержать следующее:

Изменения в случае, если, например, "новый" и "Новый"
Изменения в словоформах, например, "идите" и "обход"
Слова, которые добавляют шум, например, останавливают слова такой как и
Символы пунктуации и специальные символы
HTML-тэги и XML-тэги

Эти облака слова иллюстрируют, что анализ частотности слова применился к некоторым данным о необработанном тексте из прогнозов погоды и предварительно обработанной версии тех же текстовых данных.

Загрузите и извлеките текстовые данные

Загрузите данные в качестве примера. Файл weatherReports.csv содержит прогнозы погоды, включая текстовое описание и категориальные метки для каждого события.

filename = "weatherReports.csv";
data = readtable(filename,'TextType','string');

Извлеките текстовые данные из поля event_narrative и данные о метке из поля event_type.

textData = data.event_narrative;
labels = data.event_type;
textData(1:10)

ans = 10×1 string array
    "Large tree down between Plantersville and Nettleton."
    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
    "NWS Columbia relayed a report of trees blown down along Tom Hall St."
    "Media reported two trees blown down along I-40 in the Old Fort area."
    ""
    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
    "Quarter size hail near Rosemark."
    "Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
    "Powerlines down at Walnut Grove and Cherry Lane roads."

Создайте маркируемые документы

Создайте массив маркируемых документов.

cleanedDocuments = tokenizedDocument(textData);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

     8 tokens: Large tree down between Plantersville and Nettleton .
    39 tokens: One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour . One vehicle was stalled in the water .
    14 tokens: NWS Columbia relayed a report of trees blown down along Tom Hall St .
    14 tokens: Media reported two trees blown down along I-40 in the Old Fort area .
     0 tokens:
    15 tokens: A few tree limbs greater than 6 inches down on HWY 18 in Roseland .
    20 tokens: Awning blown off a building on Lamar Avenue . Multiple trees down near the intersection of Winchester and Perkins .
     6 tokens: Quarter size hail near Rosemark .
    21 tokens: Tin roof ripped off house on Old Memphis Road near Billings Drive . Several large trees down in the area .
    10 tokens: Powerlines down at Walnut Grove and Cherry Lane roads .

Lemmatize слова с помощью normalizeWords. Чтобы улучшить lemmatization, сначала добавьте, что часть речи назначает в документы с помощью addPartOfSpeechDetails.

cleanedDocuments = addPartOfSpeechDetails(cleanedDocuments);
cleanedDocuments = normalizeWords(cleanedDocuments,'Style','lemma');
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

     8 tokens: large tree down between plantersville and nettleton .
    39 tokens: one to two foot of deep standing water develop on a street on the winthrop university campus after more than an inch of rain fall in less than an hour . one vehicle be stall in the water .
    14 tokens: nws columbia relay a report of tree blow down along tom hall st .
    14 tokens: medium report two tree blow down along i-40 in the old fort area .
     0 tokens:
    15 tokens: a few tree limb great than 6 inch down on hwy 18 in roseland .
    20 tokens: awning blow off a building on lamar avenue . multiple tree down near the intersection of winchester and perkins .
     6 tokens: quarter size hail near rosemark .
    21 tokens: tin roof rip off house on old memphis road near billings drive . several large tree down in the area .
    10 tokens: powerlines down at walnut grove and cherry lane road .

Сотрите пунктуацию из документов.

cleanedDocuments = erasePunctuation(cleanedDocuments);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

     7 tokens: large tree down between plantersville and nettleton
    37 tokens: one to two foot of deep standing water develop on a street on the winthrop university campus after more than an inch of rain fall in less than an hour one vehicle be stall in the water
    13 tokens: nws columbia relay a report of tree blow down along tom hall st
    13 tokens: medium report two tree blow down along i40 in the old fort area
     0 tokens:
    14 tokens: a few tree limb great than 6 inch down on hwy 18 in roseland
    18 tokens: awning blow off a building on lamar avenue multiple tree down near the intersection of winchester and perkins
     5 tokens: quarter size hail near rosemark
    19 tokens: tin roof rip off house on old memphis road near billings drive several large tree down in the area
     9 tokens: powerlines down at walnut grove and cherry lane road

Слова как "a", "и", "к", и (известный как слова остановки) могут добавить шум в данные. Удалите список слов остановки с помощью функции removeStopWords.

cleanedDocuments = removeStopWords(cleanedDocuments);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

     5 tokens: large tree down plantersville nettleton
    18 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour vehicle stall water
    10 tokens: nws columbia relay report tree blow down tom hall st
    10 tokens: medium report two tree blow down i40 old fort area
     0 tokens:
    10 tokens: few tree limb great 6 inch down hwy 18 roseland
    13 tokens: awning blow off building lamar avenue multiple tree down near intersection winchester perkins
     5 tokens: quarter size hail near rosemark
    16 tokens: tin roof rip off house old memphis road near billings drive several large tree down area
     7 tokens: powerlines down walnut grove cherry lane road

Удалите слова с 2 или меньшим количеством символов и слова с 15 или больших символов.

cleanedDocuments = removeShortWords(cleanedDocuments,2);
cleanedDocuments = removeLongWords(cleanedDocuments,15);
cleanedDocuments(1:10)

ans = 
  10×1 tokenizedDocument:

     5 tokens: large tree down plantersville nettleton
    18 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour vehicle stall water
     9 tokens: nws columbia relay report tree blow down tom hall
    10 tokens: medium report two tree blow down i40 old fort area
     0 tokens:
     8 tokens: few tree limb great inch down hwy roseland
    13 tokens: awning blow off building lamar avenue multiple tree down near intersection winchester perkins
     5 tokens: quarter size hail near rosemark
    16 tokens: tin roof rip off house old memphis road near billings drive several large tree down area
     7 tokens: powerlines down walnut grove cherry lane road

Создайте модель сумки слов

Создайте модель сумки слов.

cleanedBag = bagOfWords(cleanedDocuments)

cleanedBag = 
  bagOfWords with properties:

          Counts: [36176×18469 double]
      Vocabulary: [1×18469 string]
        NumWords: 18469
    NumDocuments: 36176

Удалите слова, которые не появляются больше чем два раза в модели сумки слов.

cleanedBag = removeInfrequentWords(cleanedBag,2)

cleanedBag = 
  bagOfWords with properties:

          Counts: [36176×6974 double]
      Vocabulary: [1×6974 string]
        NumWords: 6974
    NumDocuments: 36176

Некоторые шаги предварительной обработки, такие как removeInfrequentWords оставляют пустые документы в модели сумки слов. Чтобы гарантировать, что никакие пустые документы не остаются в модели сумки слов после предварительной обработки, используйте removeEmptyDocuments в качестве последнего шага.

Удалите пустые документы из модели сумки слов и соответствующие метки от labels.

[cleanedBag,idx] = removeEmptyDocuments(cleanedBag);
labels(idx) = [];
cleanedBag

cleanedBag = 
  bagOfWords with properties:

          Counts: [28137×6974 double]
      Vocabulary: [1×6974 string]
        NumWords: 6974
    NumDocuments: 28137

Создайте функцию предварительной обработки

Может быть полезно создать функцию, которая выполняет предварительную обработку, таким образом, можно подготовить различные наборы текстовых данных таким же образом. Например, можно использовать функцию так, чтобы можно было предварительно обработать новые данные с помощью тех же шагов в качестве данных тренировки.

Создайте функцию, которая маркирует и предварительно обрабатывает текстовые данные, таким образом, они могут использоваться для анализа. Функциональный preprocessWeatherNarratives, выполняет следующие шаги:

Маркируйте текст с помощью tokenizedDocument.
Lemmatize слова с помощью normalizeWords.
Сотрите пунктуацию с помощью erasePunctuation.
Удалите список слов остановки (такой как "и", и) использование removeStopWords.
Удалите слова с 2 или меньшим количеством символов с помощью removeShortWords.
Удалите слова с 15 или больше символами с помощью removeLongWords.

Используйте пример, предварительно обрабатывающий функциональный preprocessWeatherNarratives, чтобы подготовить текстовые данные.

newText = "A tree is downed outside Apple Hill Drive, Natick";
newDocuments = preprocessWeatherNarratives(newText)

newDocuments = 
  tokenizedDocument:

   7 tokens: tree down outside apple hill drive natick

Сравните с необработанными данными

Сравните предварительно обработанные данные с необработанными данными.

rawDocuments = tokenizedDocument(textData);
rawBag = bagOfWords(rawDocuments)

rawBag = 
  bagOfWords with properties:

          Counts: [36176×23302 double]
      Vocabulary: [1×23302 string]
        NumWords: 23302
    NumDocuments: 36176

Вычислите сокращение данных.

numWordsCleaned = cleanedBag.NumWords;
numWordsRaw = rawBag.NumWords;
reduction = 1 - numWordsCleaned/numWordsRaw

reduction = 0.7007

Сравните необработанные данные и убранные данные путем визуализации двух моделей сумки слов с помощью облаков слова.

figure
subplot(1,2,1)
wordcloud(rawBag);
title("Raw Data")
subplot(1,2,2)
wordcloud(cleanedBag);
title("Cleaned Data")

Предварительная обработка функции

Функциональный preprocessWeatherNarratives, выполняет следующие шаги по порядку:

Маркируйте текст с помощью tokenizedDocument.
Lemmatize слова с помощью normalizeWords.
Сотрите пунктуацию с помощью erasePunctuation.
Удалите список слов остановки (такой как "и", и) использование removeStopWords.
Удалите слова с 2 или меньшим количеством символов с помощью removeShortWords.
Удалите слова с 15 или больше символами с помощью removeLongWords.

function documents = preprocessWeatherNarratives(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Lemmatize the words. To improve lemmatization, first use
% addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

Документация