Проанализируйте HTML и извлеките текстовое содержимое

Этот пример показывает, как проанализировать код HTML и извлечь текстовое содержимое от конкретных элементов.

Проанализируйте КОД HTML

Считайте код HTML из https://www.mathworks.com/help/textanalytics URL с помощью webread.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

Проанализируйте код HTML с помощью htmlTree.

tree = htmlTree(code);

Просмотрите имя элемента HTML дерева.

tree.Name

ans = 
"HTML"

Просмотрите дочерние элементы дерева. Дочерние элементы являются поддеревьями tree.

tree.Children

ans = 
  4×1 htmlTree:

    " "
    <HEAD><TITLE>Text Analytics Toolbox Documentation</TITLE><META charset="utf-8"/><META content="width=device-width, initial-scale=1.0" name="viewport"/><META content="IE=edge" http-equiv="X-UA-Compatible"/><LINK href="/includes_content/responsive/css/bootstrap/bootstrap.min.css" rel="stylesheet" type="text/css"/><LINK href="/includes_content/responsive/css/site6.css?20180314" rel="stylesheet" type="text/css"/><LINK href="/includes_content/responsive/css/site6_lg.css?20180314" media="screen and (min-width: 1200px)" rel="stylesheet"/><LINK href="/includes_content/responsive/css/site6_md.css?20180314" media="screen and (min-width: 992px) and (max-width: 1199px)" rel="stylesheet"/><LINK href="/includes_content/responsive/css/site6_sm+xs.css?20180314" media="screen and (max-width: 991px)" rel="stylesheet"/><LINK href="/includes_content/responsive/css/site6_sm.css?20180314" media="screen and (min-width: 768px) and (max-width: 991px)" rel="stylesheet"/><LINK href="/includes_content/responsive/…
    " "
    <BODY id="responsive_offcanvas"><!-- Mobile TopNav: Start --><DIV class="header visible-xs visible-sm" id="header_mobile" translate="no"><NAV class="navbar navbar-default" role="navigation"><DIV class="container-fluid"><DIV class="row"><DIV class="col-xs-12"><DIV class="navbar-header"><BUTTON class="navbar-toggle topnav_toggle" data-target="#topnav_collapse" data-toggle="collapse" type="button"><SPAN class="sr-only">Toggle Main Navigation</SPAN><SPAN class="icon-menu"/></BUTTON><A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A></DIV></DIV></DIV><DIV class="row visible-xs visible-sm"><DIV class="col-xs-12"><DIV class="navbar-collapse collapse" id="topnav_collapse"><UL class="nav navbar-nav" id="topnav"><LI class="headernav_login"><A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign…

Извлеките текст от дерева HTML

Чтобы извлечь текст непосредственно от дерева HTML, используйте extractHTMLText.

str = extractHTMLText(tree)

str = 
    "Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data."

Найдите элементы HTML

Чтобы найти конкретные элементы дерева HTML, используйте findElement. Найдите все гиперссылки в дереве HTML. В HTML гиперссылки используют тег "A".

selector = "A";
subtrees = findElement(tree,selector);

Просмотрите первые несколько поддеревьев.

subtrees(1:20)

ans = 
  20×1 htmlTree:

    <A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>
    <A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign In</A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A>
    <A href="https://www.mathworks.com/store?s_cid=store_top_nav&amp;s_tid=gn_store">How to Buy</A>
    <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A>
    <A href="https://www.mathworks.com/store?s_cid=store_top_nav&amp;s_tid=gn_store">How to Buy</A>
    <A class="mwa-nav_login" href="https://www.mathworks.com/login?uri=http://www.mathworks.com/help/textanalytics/index.html">Sign In</A>
    <A class="svg_link pull-left" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>

Создайте облако слова из текста гиперссылок.

str = extractHTMLText(subtrees);
figure
wordcloud(str);
title("Hyperlinks")

Получите HTML-атрибуты

Получите атрибуты класса от элементов абзаца в дереве HTML.

subtrees = findElement(tree,'p');
attr = "class";
str = getAttribute(subtrees,attr)

str = 21×1 string array
    <missing>
    <missing>
    "add_margin_5"
    <missing>
    <missing>
    <missing>
    <missing>
    <missing>
    "category_desc"
    "category_desc"
    "category_desc"
    "category_desc"
    <missing>
    <missing>
    <missing>
    "text-center"
    <missing>
    <missing>
    <missing>
    "copyright"
    <missing>

Создайте облако слова из текста, содержавшегося в элементах абзаца с классом "category_desc".

subtrees = findElement(tree,'p.category_desc');
str = extractHTMLText(subtrees);
figure
wordcloud(str);

Документация