В этом примере показано, как извлечь текстовые данные из текста, HTML, Microsoft® Word, PDF, CSV и файлы Microsoft Excel® и импортировать его в MATLAB® для анализа.
Обычно, самый легкий способ импортировать текстовые данные в MATLAB состоит в том, чтобы использовать extractFileText
функция. Эта функция извлекает текстовые данные из текста, PDF, HTML и файлов Microsoft Word. Чтобы импортировать текст из CSV и файлов Microsoft Excel, используйте readtable
. Чтобы извлечь текст из кода HTML, используйте extractHTMLText
. Чтобы считать данные из форм PDF, используйте readPDFFormData
.
Извлеките текст из sonnets.txt
использование extractFileText
. Файл sonnets.txt
содержит сонеты Шекспира в простом тексте.
filename = "sonnets.txt";
str = extractFileText(filename);
Просмотрите первый сонет путем извлечения текста между этими двумя заголовками "I
"и "II
".
start = " I" + newline; fin = " II"; sonnet1 = extractBetween(str,start,fin)
sonnet1 = " From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. "
Извлеките текст из sonnets.docx
использование extractFileText
. Файл exampleSonnets.docx
содержит сонеты Шекспира в документе Microsoft Word.
filename = "exampleSonnets.docx";
str = extractFileText(filename);
Просмотрите второй сонет путем извлечения текста между этими двумя заголовками "II
"и "III
".
start = " II" + newline; fin = " III"; sonnet2 = extractBetween(str,start,fin)
sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
Пример документ Microsoft Word использует два символа новой строки между каждой линией. Чтобы заменить эти символы на один символ новой строки, используйте replace
функция.
sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
Извлеките текст из документов в формате PDF и данные из форм PDF.
Извлеките текст из sonnets.pdf
использование extractFileText
. Файл exampleSonnets.pdf
содержит сонеты Шекспира в PDF.
filename = "exampleSonnets.pdf";
str = extractFileText(filename);
Просмотрите третий сонет путем извлечения текста между этими двумя заголовками "III
"и "IV
". Этот PDF имеет пробел перед каждым символом новой строки.
start = " III " + newline; fin = "IV"; sonnet3 = extractBetween(str,start,fin)
sonnet3 = " Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee. "
Чтобы считать текстовые данные из форм PDF, используйте readPDFFormData
. Функция возвращает struct, содержащий данные из полей формы PDF.
filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
event_type: "Thunderstorm Wind"
event_narrative: "Large tree down between Plantersville and Nettleton."
Извлеките текст из файлов HTML, кода HTML и сети.
Чтобы извлечь текстовые данные из сохраненного файла HTML, используйте extractFileText
.
filename = "exampleSonnets.html";
str = extractFileText(filename);
Просмотрите дальше сонет путем извлечения текста между этими двумя заголовками "IV"
и "V"
.
start = newline + "IV" + newline; fin = newline + "V" + newline; sonnet4 = extractBetween(str,start,fin)
sonnet4 = " Unthrifty loveliness, why dost thou spend Upon thy self thy beauty's legacy? Nature's bequest gives nothing, but doth lend, And being frank she lends to those are free: Then, beauteous niggard, why dost thou abuse The bounteous largess given thee to give? Profitless usurer, why dost thou use So great a sum of sums, yet canst not live? For having traffic with thy self alone, Thou of thy self thy sweet self dost deceive: Then how when nature calls thee to be gone, What acceptable audit canst thou leave? Thy unused beauty must be tombed with thee, Which, used, lives th' executor to be. "
Чтобы извлечь текстовые данные из кода HTML строки, содержащей, используйте extractHTMLText
.
code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = "THE SONNETS by William Shakespeare"
Чтобы извлечь текстовые данные из веб-страницы, сначала считайте код HTML с помощью webread
, и затем используйте extractHTMLText
.
url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 'Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling. Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models. Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.'
Чтобы найти конкретные элементы кода HTML, проанализируйте код с помощью htmlTree
и используйте findElement
. Проанализируйте код HTML и найдите все гиперссылки. Гиперссылки являются узлами с именем элемента "A"
.
tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);
Просмотрите первые 10 поддеревьев и извлеките текст с помощью extractHTMLText
.
subtrees(1:10)
ans = 10×1 htmlTree: <A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A> <A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign In</A> <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A> <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A> <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A> <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A> <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A> <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A> <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A> <A href="https://www.mathworks.com/store?s_cid=store_top_nav&s_tid=gn_store">How to Buy</A>
str = extractHTMLText(subtrees);
Просмотрите извлеченный текст первых 10 гиперссылок.
str(1:10)
ans = 10×1 string array
""
"Sign In"
"Products"
"Solutions"
"Academia"
"Support"
"Community"
"Events"
"Contact Us"
"How to Buy"
Чтобы получить цели ссылки, используйте getAttributes
и задайте атрибут "href"
(ссылка гиперссылки). Получите цели ссылки первых 10 поддеревьев.
attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string array
"https://www.mathworks.com?s_tid=gn_logo"
"https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html"
"https://www.mathworks.com/products.html?s_tid=gn_ps"
"https://www.mathworks.com/solutions.html?s_tid=gn_sol"
"https://www.mathworks.com/academia.html?s_tid=gn_acad"
"https://www.mathworks.com/support.html?s_tid=gn_supp"
"https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
"https://www.mathworks.com/company/events.html?s_tid=gn_ev"
"https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus"
"https://www.mathworks.com/store?s_cid=store_top_nav&s_tid=gn_store"
Чтобы извлечь текстовые данные из CSV и файлов Microsoft Excel, используйте readtable
и извлеките текстовые данные из таблицы, которую они возвращают.
Извлеките табличные данные с помощью readtable
функционируйте и просмотрите первые несколько строк таблицы.
T = readtable('weatherReports.csv','TextType','string'); head(T)
ans=8×16 table
Time event_id state event_type damage_property damage_crops begin_lat begin_lon end_lat end_lon event_narrative storm_duration begin_day end_day year end_timestamp
____________________ __________ ________________ ___________________ _______________ ____________ _________ _________ _______ _______ _________________________________________________________________________________________________________________________________________________________________________________________________ ______________ _________ _______ ____ ____________________
22-Jul-2016 16:10:00 6.4433e+05 "MISSISSIPPI" "Thunderstorm Wind" "" "0.00K" 34.14 -88.63 34.122 -88.626 "Large tree down between Plantersville and Nettleton." 00:05:00 22 22 2016 22-Jul-0016 16:15:00
15-Jul-2016 17:15:00 6.5182e+05 "SOUTH CAROLINA" "Heavy Rain" "2.00K" "0.00K" 34.94 -81.03 34.94 -81.03 "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water." 00:00:00 15 15 2016 15-Jul-0016 17:15:00
15-Jul-2016 17:25:00 6.5183e+05 "SOUTH CAROLINA" "Thunderstorm Wind" "0.00K" "0.00K" 35.01 -80.93 35.01 -80.93 "NWS Columbia relayed a report of trees blown down along Tom Hall St." 00:00:00 15 15 2016 15-Jul-0016 17:25:00
16-Jul-2016 12:46:00 6.5183e+05 "NORTH CAROLINA" "Thunderstorm Wind" "0.00K" "0.00K" 35.64 -82.14 35.64 -82.14 "Media reported two trees blown down along I-40 in the Old Fort area." 00:00:00 16 16 2016 16-Jul-0016 12:46:00
15-Jul-2016 14:28:00 6.4332e+05 "MISSOURI" "Hail" "" "" 36.45 -89.97 36.45 -89.97 "" 00:07:00 15 15 2016 15-Jul-0016 14:35:00
15-Jul-2016 16:31:00 6.4332e+05 "ARKANSAS" "Thunderstorm Wind" "" "0.00K" 35.85 -90.1 35.838 -90.087 "A few tree limbs greater than 6 inches down on HWY 18 in Roseland." 00:09:00 15 15 2016 15-Jul-0016 16:40:00
15-Jul-2016 16:03:00 6.4343e+05 "TENNESSEE" "Thunderstorm Wind" "20.00K" "0.00K" 35.056 -89.937 35.05 -89.904 "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins." 00:07:00 15 15 2016 15-Jul-0016 16:10:00
15-Jul-2016 17:27:00 6.4344e+05 "TENNESSEE" "Hail" "" "" 35.385 -89.78 35.385 -89.78 "Quarter size hail near Rosemark." 00:05:00 15 15 2016 15-Jul-0016 17:32:00
Извлеките текстовые данные из event_narrative
столбец и представление первые несколько строк.
str = T.event_narrative; str(1:10)
ans = 10×1 string array
"Large tree down between Plantersville and Nettleton."
"One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
"NWS Columbia relayed a report of trees blown down along Tom Hall St."
"Media reported two trees blown down along I-40 in the Old Fort area."
""
"A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
"Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
"Quarter size hail near Rosemark."
"Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
"Powerlines down at Walnut Grove and Cherry Lane roads."
Если ваши текстовые данные содержатся в нескольких файлах в папке, то можно импортировать текстовые данные в MATLAB с помощью datastore файла.
Создайте datastore файла для текстовых файлов сонета в качестве примера. Файлы в качестве примера называют "exampleSonnetN.txt
", где N
количество сонета. Задайте имя файла с помощью подстановочного знака "*", чтобы найти все имена файлов этой структуры. Чтобы задать чтение функционируют, чтобы быть extractFileText
, введите эту функцию к fileDatastore
использование указателя на функцию.
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',@extractFileText)
fds = FileDatastore with properties: Files: { ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet1.txt'; ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet2.txt'; ' ...\Documents\MATLAB\examples\textanalytics-ex15735454\exampleSonnet3.txt' ... and 1 more } UniformRead: 0 ReadFcn: @extractFileText AlternateFileSystemRoots: {}
Цикл по файлам в datastore и считал каждый текстовый файл.
str = []; while hasdata(fds) textData = read(fds); str = [str; textData]; end
Просмотрите извлеченный текст.
str
str = 4×1 string array
" From fairest creatures we desire increase,↵ That thereby beauty's rose might never die,↵ But as the riper should by time decease,↵ His tender heir might bear his memory:↵ But thou, contracted to thine own bright eyes,↵ Feed'st thy light's flame with self-substantial fuel,↵ Making a famine where abundance lies,↵ Thy self thy foe, to thy sweet self too cruel:↵ Thou that art now the world's fresh ornament,↵ And only herald to the gaudy spring,↵ Within thine own bud buriest thy content,↵ And tender churl mak'st waste in niggarding:↵ Pity the world, or else this glutton be,↵ To eat the world's due, by the grave and thee."
" When forty winters shall besiege thy brow,↵ And dig deep trenches in thy beauty's field,↵ Thy youth's proud livery so gazed on now,↵ Will be a tatter'd weed of small worth held:↵ Then being asked, where all thy beauty lies,↵ Where all the treasure of thy lusty days;↵ To say, within thine own deep sunken eyes,↵ Were an all-eating shame, and thriftless praise.↵ How much more praise deserv'd thy beauty's use,↵ If thou couldst answer 'This fair child of mine↵ Shall sum my count, and make my old excuse,'↵ Proving his beauty by succession thine!↵ This were to be new made when thou art old,↵ And see thy blood warm when thou feel'st it cold."
" Look in thy glass and tell the face thou viewest↵ Now is the time that face should form another;↵ Whose fresh repair if now thou not renewest,↵ Thou dost beguile the world, unbless some mother.↵ For where is she so fair whose unear'd womb↵ Disdains the tillage of thy husbandry?↵ Or who is he so fond will be the tomb,↵ Of his self-love to stop posterity?↵ Thou art thy mother's glass and she in thee↵ Calls back the lovely April of her prime;↵ So thou through windows of thine age shalt see,↵ Despite of wrinkles this thy golden time.↵ But if thou live, remember'd not to be,↵ Die single and thine image dies with thee."
" Unthrifty loveliness, why dost thou spend↵ Upon thy self thy beauty's legacy?↵ Nature's bequest gives nothing, but doth lend,↵ And being frank she lends to those are free:↵ Then, beauteous niggard, why dost thou abuse↵ The bounteous largess given thee to give?↵ Profitless usurer, why dost thou use↵ So great a sum of sums, yet canst not live?↵ For having traffic with thy self alone,↵ Thou of thy self thy sweet self dost deceive:↵ Then how when nature calls thee to be gone,↵ What acceptable audit canst thou leave?↵ Thy unused beauty must be tombed with thee,↵ Which, used, lives th' executor to be."
extractFileText
| extractHTMLText
| readPDFFormData
| tokenizedDocument