Извлеките текстовые данные из файлов

Этот пример показывает, как извлечь текстовые данные из текста, HTML, Microsoft® Word, PDF, CSV и файлы Microsoft Excel® и импортировать его в MATLAB® для анализа.

Обычно, самый легкий способ импортировать текстовые данные в MATLAB состоит в том, чтобы использовать функцию extractFileText. Эта функция извлекает текстовые данные из текста, PDF, HTML и файлов Microsoft Word. Чтобы импортировать текст из CSV и файлов Microsoft Excel, используйте readtable. Чтобы извлечь текст от кода HTML, используйте extractHTMLText. Чтобы считать данные из форм PDF, используйте readPDFFormData.

Текстовый файл

Извлеките текст от sonnets.txt с помощью extractFileText. Файл sonnets.txt содержит сонеты Шекспира в простом тексте.

filename = "sonnets.txt";
str = extractFileText(filename);

Просмотрите первый сонет путем извлечения текста между этими двумя заголовками "I" и "II".

start = " I" + newline;
fin = " II";
sonnet1 = extractBetween(str,start,fin)
sonnet1 = 
    "
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
     
      "

Документ Microsoft Word

Извлеките текст от sonnets.docx с помощью extractFileText. Файл exampleSonnets.docx содержит сонеты Шекспира в документе Microsoft Word.

filename = "exampleSonnets.docx";
str = extractFileText(filename);

Просмотрите второй сонет путем извлечения текста между этими двумя заголовками "II" и "III".

start = " II" + newline;
fin = " III";
sonnet2 = extractBetween(str,start,fin)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
     
       And dig deep trenches in thy beauty's field,
     
       Thy youth's proud livery so gazed on now,
     
       Will be a tatter'd weed of small worth held:
     
       Then being asked, where all thy beauty lies,
     
       Where all the treasure of thy lusty days;
     
       To say, within thine own deep sunken eyes,
     
       Were an all-eating shame, and thriftless praise.
     
       How much more praise deserv'd thy beauty's use,
     
       If thou couldst answer 'This fair child of mine
     
       Shall sum my count, and make my old excuse,'
     
       Proving his beauty by succession thine!
     
         This were to be new made when thou art old,
     
         And see thy blood warm when thou feel'st it cold.
     
      "

Пример документ Microsoft Word использует два символа новой строки между каждой строкой. Чтобы заменить эти символы на один символ новой строки, используйте функцию replace.

sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
       And dig deep trenches in thy beauty's field,
       Thy youth's proud livery so gazed on now,
       Will be a tatter'd weed of small worth held:
       Then being asked, where all thy beauty lies,
       Where all the treasure of thy lusty days;
       To say, within thine own deep sunken eyes,
       Were an all-eating shame, and thriftless praise.
       How much more praise deserv'd thy beauty's use,
       If thou couldst answer 'This fair child of mine
       Shall sum my count, and make my old excuse,'
       Proving his beauty by succession thine!
         This were to be new made when thou art old,
         And see thy blood warm when thou feel'st it cold.
      "

Файлы PDF

Извлеките текст из документов в формате PDF и данные из форм PDF.

Документ в формате PDF

Извлеките текст от sonnets.pdf с помощью extractFileText. Файл exampleSonnets.pdf содержит сонеты Шекспира в PDF.

filename = "exampleSonnets.pdf";
str = extractFileText(filename);

Просмотрите третий сонет путем извлечения текста между этими двумя заголовками "III" и "IV". Этот PDF имеет пробел перед каждым символом новой строки.

start = " III " + newline;
fin = "IV";
sonnet3 = extractBetween(str,start,fin)
sonnet3 = 
    " 
       Look in thy glass and tell the face thou viewest 
       Now is the time that face should form another; 
       Whose fresh repair if now thou not renewest, 
       Thou dost beguile the world, unbless some mother. 
       For where is she so fair whose unear'd womb 
       Disdains the tillage of thy husbandry? 
       Or who is he so fond will be the tomb, 
       Of his self-love to stop posterity? 
       Thou art thy mother's glass and she in thee 
       Calls back the lovely April of her prime; 
       So thou through windows of thine age shalt see, 
       Despite of wrinkles this thy golden time. 
         But if thou live, remember'd not to be, 
         Die single and thine image dies with thee. 
     
      
       "

Форма PDF

Чтобы считать текстовые данные из форм PDF, используйте readPDFFormData. Функция возвращает struct, содержащий данные из полей формы PDF.

filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
         event_type: "Thunderstorm Wind"
    event_narrative: "Large tree down between Plantersville and Nettleton."

HTML

Извлеките текст из файлов HTML, кода HTML и сети.

Файл HTML

Чтобы извлечь текстовые данные из сохраненного файла HTML, используйте extractFileText.

filename = "exampleSonnets.html";
str = extractFileText(filename);

Просмотрите дальше сонет путем извлечения текста между этими двумя заголовками "IV" и "V".

start = newline + "IV" + newline;
fin = newline + "V" + newline;
sonnet4 = extractBetween(str,start,fin)
sonnet4 = 
    "
     Unthrifty loveliness, why dost thou spend
      Upon thy self thy beauty's legacy?
      Nature's bequest gives nothing, but doth lend,
      And being frank she lends to those are free:
      Then, beauteous niggard, why dost thou abuse
      The bounteous largess given thee to give?
      Profitless usurer, why dost thou use
      So great a sum of sums, yet canst not live?
      For having traffic with thy self alone,
      Thou of thy self thy sweet self dost deceive:
      Then how when nature calls thee to be gone,
      What acceptable audit canst thou leave?
      Thy unused beauty must be tombed with thee,
      Which, used, lives th' executor to be.
     "

КОД HTML

Чтобы извлечь текстовые данные из кода HTML строки, содержащей, используйте extractHTMLText.

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = 
    "THE SONNETS
     
     by William Shakespeare"

С сети

Чтобы извлечь текстовые данные из веб-страницы, сначала считайте код HTML с помощью webread, и затем используйте extractHTMLText.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 
    'Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.'

Проанализируйте КОД HTML

Чтобы найти конкретные элементы кода HTML, проанализируйте код с помощью htmlTree и используйте findElement. Проанализируйте код HTML и найдите все гиперссылки. Гиперссылки являются узлами с именем элемента "A".

tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);

Просмотрите первые 10 поддеревьев и извлеките текст с помощью extractHTMLText.

subtrees(1:10)
ans = 
  10×1 htmlTree:

    <A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>
    <A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign In</A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A>
    <A href="https://www.mathworks.com/store?s_cid=store_top_nav&amp;s_tid=gn_store">How to Buy</A>

str = extractHTMLText(subtrees);

Просмотрите извлеченный текст первых 10 гиперссылок.

str(1:10)
ans = 10×1 string array
    ""
    "Sign In"
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Contact Us"
    "How to Buy"

Чтобы получить цели ссылки, используйте getAttributes и задайте атрибут "href" (ссылка гиперссылки). Получите цели ссылки первых 10 поддеревьев.

attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string array
    "https://www.mathworks.com?s_tid=gn_logo"
    "https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html"
    "https://www.mathworks.com/products.html?s_tid=gn_ps"
    "https://www.mathworks.com/solutions.html?s_tid=gn_sol"
    "https://www.mathworks.com/academia.html?s_tid=gn_acad"
    "https://www.mathworks.com/support.html?s_tid=gn_supp"
    "https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
    "https://www.mathworks.com/company/events.html?s_tid=gn_ev"
    "https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus"
    "https://www.mathworks.com/store?s_cid=store_top_nav&s_tid=gn_store"

CSV и файлы Microsoft Excel

Чтобы извлечь текстовые данные из CSV и файлов Microsoft Excel, используйте readtable и извлеките текстовые данные из таблицы, которую это возвращает.

Извлеките табличные данные с помощью readtable, функционируют и просматривают первые несколько строк таблицы.

T = readtable('weatherReports.csv','TextType','string');
head(T)
ans=8×16 table
            Time             event_id          state              event_type         damage_property    damage_crops    begin_lat    begin_lon    end_lat    end_lon                                                                                             event_narrative                                                                                             storm_duration    begin_day    end_day    year       end_timestamp    
    ____________________    __________    ________________    ___________________    _______________    ____________    _________    _________    _______    _______    _________________________________________________________________________________________________________________________________________________________________________________________________    ______________    _________    _______    ____    ____________________

    22-Jul-2016 16:10:00    6.4433e+05    "MISSISSIPPI"       "Thunderstorm Wind"       ""                "0.00K"         34.14        -88.63     34.122     -88.626    "Large tree down between Plantersville and Nettleton."                                                                                                                                                  00:05:00          22          22       2016    22-Jul-0016 16:15:00
    15-Jul-2016 17:15:00    6.5182e+05    "SOUTH CAROLINA"    "Heavy Rain"              "2.00K"           "0.00K"         34.94        -81.03      34.94      -81.03    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."       00:00:00          15          15       2016    15-Jul-0016 17:15:00
    15-Jul-2016 17:25:00    6.5183e+05    "SOUTH CAROLINA"    "Thunderstorm Wind"       "0.00K"           "0.00K"         35.01        -80.93      35.01      -80.93    "NWS Columbia relayed a report of trees blown down along Tom Hall St."                                                                                                                                  00:00:00          15          15       2016    15-Jul-0016 17:25:00
    16-Jul-2016 12:46:00    6.5183e+05    "NORTH CAROLINA"    "Thunderstorm Wind"       "0.00K"           "0.00K"         35.64        -82.14      35.64      -82.14    "Media reported two trees blown down along I-40 in the Old Fort area."                                                                                                                                  00:00:00          16          16       2016    16-Jul-0016 12:46:00
    15-Jul-2016 14:28:00    6.4332e+05    "MISSOURI"          "Hail"                    ""                ""              36.45        -89.97      36.45      -89.97    ""                                                                                                                                                                                                      00:07:00          15          15       2016    15-Jul-0016 14:35:00
    15-Jul-2016 16:31:00    6.4332e+05    "ARKANSAS"          "Thunderstorm Wind"       ""                "0.00K"         35.85         -90.1     35.838     -90.087    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."                                                                                                                                    00:09:00          15          15       2016    15-Jul-0016 16:40:00
    15-Jul-2016 16:03:00    6.4343e+05    "TENNESSEE"         "Thunderstorm Wind"       "20.00K"          "0.00K"        35.056       -89.937      35.05     -89.904    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."                                                                                     00:07:00          15          15       2016    15-Jul-0016 16:10:00
    15-Jul-2016 17:27:00    6.4344e+05    "TENNESSEE"         "Hail"                    ""                ""             35.385        -89.78     35.385      -89.78    "Quarter size hail near Rosemark."                                                                                                                                                                      00:05:00          15          15       2016    15-Jul-0016 17:32:00

Извлеките текстовые данные из столбца event_narrative и просмотрите первые несколько строк.

str = T.event_narrative;
str(1:10)
ans = 10×1 string array
    "Large tree down between Plantersville and Nettleton."
    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
    "NWS Columbia relayed a report of trees blown down along Tom Hall St."
    "Media reported two trees blown down along I-40 in the Old Fort area."
    ""
    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
    "Quarter size hail near Rosemark."
    "Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
    "Powerlines down at Walnut Grove and Cherry Lane roads."

Извлеките текст из нескольких файлов

Если ваши текстовые данные содержатся в нескольких файлах в папке, то можно импортировать текстовые данные в MATLAB с помощью datastore файла.

Создайте datastore файла для текстовых файлов сонета в качестве примера. Файлы в качестве примера называют "exampleSonnetN.txt", где N является количеством сонета. Задайте имя файла с помощью подстановочного знака "*", чтобы найти все имена файлов этой структуры. Чтобы задать функцию чтения, чтобы быть extractFileText, введите эту функцию к fileDatastore с помощью указателя на функцию.

fds = fileDatastore('exampleSonnet*.txt','ReadFcn',@extractFileText)
fds = 
  FileDatastore with properties:

                       Files: {
                              ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet1.txt';
                              ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet2.txt';
                              ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet3.txt'
                               ... and 1 more
                              }
                 UniformRead: 0
                     ReadFcn: @extractFileText
    AlternateFileSystemRoots: {}

Цикл по файлам в datastore и считал каждый текстовый файл.

str = [];
while hasdata(fds)
    textData = read(fds);
    str = [str; textData];
end

Просмотрите извлеченный текст.

str
str = 4×1 string array
    "  From fairest creatures we desire increase,↵  That thereby beauty's rose might never die,↵  But as the riper should by time decease,↵  His tender heir might bear his memory:↵  But thou, contracted to thine own bright eyes,↵  Feed'st thy light's flame with self-substantial fuel,↵  Making a famine where abundance lies,↵  Thy self thy foe, to thy sweet self too cruel:↵  Thou that art now the world's fresh ornament,↵  And only herald to the gaudy spring,↵  Within thine own bud buriest thy content,↵  And tender churl mak'st waste in niggarding:↵    Pity the world, or else this glutton be,↵    To eat the world's due, by the grave and thee."
    "  When forty winters shall besiege thy brow,↵  And dig deep trenches in thy beauty's field,↵  Thy youth's proud livery so gazed on now,↵  Will be a tatter'd weed of small worth held:↵  Then being asked, where all thy beauty lies,↵  Where all the treasure of thy lusty days;↵  To say, within thine own deep sunken eyes,↵  Were an all-eating shame, and thriftless praise.↵  How much more praise deserv'd thy beauty's use,↵  If thou couldst answer 'This fair child of mine↵  Shall sum my count, and make my old excuse,'↵  Proving his beauty by succession thine!↵    This were to be new made when thou art old,↵    And see thy blood warm when thou feel'st it cold."
    "  Look in thy glass and tell the face thou viewest↵  Now is the time that face should form another;↵  Whose fresh repair if now thou not renewest,↵  Thou dost beguile the world, unbless some mother.↵  For where is she so fair whose unear'd womb↵  Disdains the tillage of thy husbandry?↵  Or who is he so fond will be the tomb,↵  Of his self-love to stop posterity?↵  Thou art thy mother's glass and she in thee↵  Calls back the lovely April of her prime;↵  So thou through windows of thine age shalt see,↵  Despite of wrinkles this thy golden time.↵    But if thou live, remember'd not to be,↵    Die single and thine image dies with thee."
    "  Unthrifty loveliness, why dost thou spend↵  Upon thy self thy beauty's legacy?↵  Nature's bequest gives nothing, but doth lend,↵  And being frank she lends to those are free:↵  Then, beauteous niggard, why dost thou abuse↵  The bounteous largess given thee to give?↵  Profitless usurer, why dost thou use↵  So great a sum of sums, yet canst not live?↵  For having traffic with thy self alone,↵  Thou of thy self thy sweet self dost deceive:↵  Then how when nature calls thee to be gone,↵  What acceptable audit canst thou leave?↵    Thy unused beauty must be tombed with thee,↵    Which, used, lives th' executor to be."

Смотрите также

| | |

Похожие темы

Для просмотра документации необходимо авторизоваться на сайте