Список слов остановки
Слова как "a", "и", "к", и (известный как слова остановки) могут добавить шум в данные. Используйте списки слов остановки, чтобы помочь создать пользовательские списки слов, чтобы удалить перед анализом.
Чтобы удалить список по умолчанию слов остановки из маркируемых документов с помощью деталей языка документов, используйте removeStopWords
. Чтобы удалить пользовательский список слов из маркируемых документов, используйте removeWords
.
Функция возвращает английский, японский язык, немецкий язык и корейские списки слов остановки.
Чтобы удалить список по умолчанию слов остановки с помощью деталей языка документов, используйте removeStopWords
.
Чтобы удалить пользовательский список слов остановки, используйте removeWords
функция. Можно использовать список слов остановки, возвращенный stopWords
функционируйте как начальную точку.
Загрузите данные в качестве примера. Файл sonnetsPreprocessed.txt
содержит предварительно обработанные версии сонетов Шекспира. Файл содержит один сонет на строку со словами, разделенными пробелом. Извлеките текст из sonnetsPreprocessed.txt
, разделите текст в документы в символах новой строки, и затем маркируйте документы.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Просмотрите первые несколько документов.
documents(1:5)
ans = 5x1 tokenizedDocument: 70 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory thou contracted thine own bright eyes feedst thy lights flame selfsubstantial fuel making famine abundance lies thy self thy foe thy sweet self cruel thou art worlds fresh ornament herald gaudy spring thine own bud buriest thy content tender churl makst waste niggarding pity world else glutton eat worlds due grave thee 71 tokens: forty winters shall besiege thy brow dig deep trenches thy beautys field thy youths proud livery gazed tatterd weed small worth held asked thy beauty lies treasure thy lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd thy beautys thou couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made thou art old thy blood warm thou feelst cold 65 tokens: look thy glass tell face thou viewest time face form another whose fresh repair thou renewest thou dost beguile world unbless mother fair whose uneard womb disdains tillage thy husbandry fond tomb selflove stop posterity thou art thy mothers glass thee calls back lovely april prime thou windows thine age shalt despite wrinkles thy golden time thou live rememberd die single thine image dies thee 71 tokens: unthrifty loveliness why dost thou spend upon thy self thy beautys legacy natures bequest gives nothing doth lend frank lends free beauteous niggard why dost thou abuse bounteous largess thee give profitless usurer why dost thou great sum sums yet canst live traffic thy self alone thou thy self thy sweet self dost deceive nature calls thee gone acceptable audit canst thou leave thy unused beauty tombed thee lives th executor 61 tokens: hours gentle work frame lovely gaze every eye doth dwell play tyrants same unfair fairly doth excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet
Создайте список слов остановки начиная с выхода stopWords
функция.
customStopWords = [stopWords "thy" "thee" "thou" "dost" "doth"];
Удалите пользовательские слова остановки из документов и просмотрите первые несколько документов.
documents = removeWords(documents,customStopWords); documents(1:5)
ans = 5x1 tokenizedDocument: 62 tokens: fairest creatures desire increase thereby beautys rose might never die riper time decease tender heir might bear memory contracted thine own bright eyes feedst lights flame selfsubstantial fuel making famine abundance lies self foe sweet self cruel art worlds fresh ornament herald gaudy spring thine own bud buriest content tender churl makst waste niggarding pity world else glutton eat worlds due grave 61 tokens: forty winters shall besiege brow dig deep trenches beautys field youths proud livery gazed tatterd weed small worth held asked beauty lies treasure lusty days say thine own deep sunken eyes alleating shame thriftless praise praise deservd beautys couldst answer fair child mine shall sum count make old excuse proving beauty succession thine new made art old blood warm feelst cold 52 tokens: look glass tell face viewest time face form another whose fresh repair renewest beguile world unbless mother fair whose uneard womb disdains tillage husbandry fond tomb selflove stop posterity art mothers glass calls back lovely april prime windows thine age shalt despite wrinkles golden time live rememberd die single thine image dies 52 tokens: unthrifty loveliness why spend upon self beautys legacy natures bequest gives nothing lend frank lends free beauteous niggard why abuse bounteous largess give profitless usurer why great sum sums yet canst live traffic self alone self sweet self deceive nature calls gone acceptable audit canst leave unused beauty tombed lives th executor 59 tokens: hours gentle work frame lovely gaze every eye dwell play tyrants same unfair fairly excel neverresting time leads summer hideous winter confounds sap checked frost lusty leaves quite gone beauty oersnowed bareness every summers distillation left liquid prisoner pent walls glass beautys effect beauty bereft nor nor remembrance flowers distilld though winter meet leese show substance still lives sweet
Получите список английских слов остановки с помощью stopWords
функция. Для удобочитаемости измените выход.
words = stopWords; reshape(words,[25 9])
ans = 25x9 string array
Columns 1 through 6
"a" "but" "during" "hows" "it's" "said"
"about" "by" "each" "however" "it’s" "says"
"above" "can" "either" "i" "its" "see"
"across" "can't" "for" "i'd" "let's" "she"
"after" "can’t" "from" "i’d" "let’s" "she'd"
"all" "cant" "given" "i'll" "lets" "she’d"
"along" "cannot" "had" "i’ll" "may" "shed"
"also" "could" "has" "i'm" "me" "she'll"
"am" "couldn't" "have" "i’m" "more" "she’ll"
"an" "couldn’t" "having" "im" "most" "shell"
"and" "couldnt" "he" "i've" "much" "should"
"any" "did" "he'd" "i’ve" "must" "since"
"are" "didn't" "he’d" "ive" "my" "so"
"aren't" "didn’t" "hed" "if" "no" "some"
"aren’t" "didnt" "he'll" "in" "not" "such"
"arent" "do" "he’ll" "instead" "now" "than"
"as" "does" "her" "into" "of" "that"
"at" "doesn't" "here" "is" "on" "the"
"be" "doesn’t" "hers" "isn't" "one" "their"
"because" "doesnt" "him" "isn’t" "only" "them"
"been" "doing" "himself" "isnt" "or" "then"
"before" "done" "his" "it" "other" "there"
"being" "don't" "how" "it'll" "our" "therefore"
"between" "don’t" "how's" "it’ll" "out" "these"
"both" "dont" "how’s" "itll" "over" "they"
Columns 7 through 9
"this" "we’re" "who’ve"
"those" "we've" "whove"
"through" "we’ve" "will"
"to" "weve" "with"
"too" "were" "within"
"towards" "what" "without"
"under" "what's" "won't"
"until" "what’s" "won’t"
"us" "whats" "would"
"use" "when" "wouldn't"
"used" "when's" "wouldn’t"
"uses" "when’s" "you"
"using" "whens" "you'd"
"very" "where" "you’d"
"want" "whether" "youd"
"was" "which" "you'll"
"wasn't" "while" "you’ll"
"wasn’t" "who" "youll"
"wasnt" "who'll" "you're"
"we" "who’ll" "you’re"
"we'd" "wholl" "youre"
"we’d" "who's" "you've"
"we'll" "who’s" "you’ve"
"we’ll" "whos" "youve"
"we're" "who've" "your"
Получите список японских слов остановки с помощью stopWords
функция. Для удобочитаемости измените выход.
words = stopWords('Language','ja'); reshape([words strings(1,8)],[35 11])
ans = 35x11 string array
Columns 1 through 7
"あそこ" "さらい" "なかば" "下" "今" "地" "列"
"あたり" "さん" "なに" "字" "部" "員" "事"
"あちら" "しかた" "など" "年" "課" "線" "士"
"あっち" "しよう" "なん" "月" "係" "点" "台"
"あと" "すか" "はじめ" "日" "外" "書" "集"
"あな" "ずつ" "はず" "時" "類" "品" "様"
"あなた" "すね" "はるか" "分" "達" "力" "所"
"あれ" "すべて" "ひと" "秒" "気" "法" "歴"
"いくつ" "ぜんぶ" "ひとつ" "週" "室" "感" "器"
"いつ" "そう" "ふく" "火" "口" "作" "名"
"いま" "そこ" "ぶり" "水" "誰" "元" "情"
"いや" "そちら" "べつ" "木" "用" "手" "連"
"いろいろ" "そっち" "へん" "金" "界" "数" "毎"
"うち" "そで" "ぺん" "土" "会" "彼" "式"
"おおまか" "それ" "ほう" "国" "首" "彼女" "簿"
"おまえ" "それぞれ" "ほか" "都" "男" "子" "回"
"おれ" "それなり" "まさ" "道" "女" "内" "匹"
"がい" "たくさん" "まし" "府" "別" "楽" "個"
"かく" "たち" "まとも" "県" "話" "喜" "席"
"かたち" "たび" "まま" "市" "私" "怒" "束"
"かやの" "ため" "みたい" "区" "屋" "哀" "歳"
"から" "だめ" "みつ" "町" "店" "輪" "目"
"がら" "ちゃ" "みなさん" "村" "家" "頃" "通"
"きた" "ちゃん" "みんな" "各" "場" "化" "面"
"くせ" "てん" "もと" "第" "等" "境" "円"
"ここ" "とおり" "もの" "方" "見" "俺" "玉"
"こっち" "とき" "もん" "何" "際" "奴" "枚"
"こと" "どこ" "やつ" "的" "観" "高" "前"
"ごと" "どこか" "よう" "度" "段" "校" "後"
"こちら" "ところ" "よそ" "文" "略" "婦" "左"
Columns 8 through 11
"秋" "本当" "う" "どう"
"冬" "確か" "え" "な"
"一" "時点" "お" "ない"
"二" "全部" "か" "なり"
"三" "関係" "が" "なる"
"四" "近く" "こそ" "に"
"五" "方法" "この" "ね"
"六" "我々" "さ" "の"
"七" "違い" "さえ" "ので"
"八" "多く" "し" "のに"
"九" "扱い" "しか" "は"
"十" "新た" "する" "ばかり"
"百" "その後" "ず" "へ"
"千" "半ば" "せる" "ほど"
"万" "結局" "そして" "ます"
"億" "様々" "その" "ませ"
"兆" "以前" "た" "また"
"下記" "以後" "たい" "まで"
"上記" "以降" "ただ" "も"
"時間" "未満" "だ" "や"
"今回" "以上" "だけ" "やら"
"前回" "以下" "だに" "よ"
"場合" "幾つ" "だの" "より"
"一つ" "毎日" "ち" "れる"
"年生" "自体" "って" "わ"
"自分" "向こう" "て" "を"
"ヶ所" "何人" "で" "ん"
"ヵ所" "手段" "でし" ""
"カ所" "同じ" "です" ""
"箇所" "感じ" "では" ""
⋮
Получите список немецких слов остановки с помощью stopWords
функция. Для удобочитаемости измените выход.
words = stopWords('Language','de'); reshape([words strings(1,7)],[25 8])
ans = 25x8 string array
Columns 1 through 6
"ab" "dann" "doch" "hattet" "jene" "mein"
"aber" "das" "du" "her" "jenem" "meine"
"alle" "dass" "durch" "hin" "jenen" "meinem"
"allem" "daß" "ein" "hätte" "jener" "meinen"
"allen" "dein" "eine" "hättest" "jenes" "meiner"
"aller" "deine" "einem" "hättet" "kann" "meines"
"alles" "deinem" "einen" "ich" "kannst" "mich"
"als" "deiner" "einer" "ihm" "kein" "mir"
"also" "deines" "eines" "ihn" "keine" "mit"
"am" "dem" "er" "ihr" "keinem" "muss"
"an" "den" "es" "ihre" "keinen" "musst"
"andere" "denn" "euch" "ihrem" "keiner" "musste"
"anderem" "der" "euer" "ihren" "keines" "muß"
"anderen" "derer" "eure" "ihrer" "können" "müssen"
"anderer" "des" "eurem" "ihres" "könnte" "müssten"
"anderes" "dessen" "euren" "im" "könnten" "nach"
"auch" "dich" "eures" "in" "könntest" "nicht"
"auf" "die" "für" "ins" "ließ" "nichts"
"aus" "dies" "ganz" "ist" "man" "noch"
"bei" "diese" "gar" "ja" "manche" "nun"
"bin" "diesem" "habe" "jede" "manchem" "nur"
"bis" "diesen" "haben" "jedem" "manchen" "ob"
"bist" "dieser" "hat" "jeden" "mancher" "oder"
"da" "dieses" "hatte" "jeder" "manches" "seid"
"damit" "dir" "hattest" "jedes" "mehr" "sein"
Columns 7 through 8
"seine" "welcher"
"seinem" "welches"
"seinen" "wenn"
"seiner" "wer"
"seines" "werde"
"sich" "werden"
"sie" "weshalb"
"sind" "wie"
"so" "wieder"
"um" "wieso"
"und" "wir"
"uns" "wirst"
"unter" "wo"
"vom" "während"
"von" "zu"
"vor" "zum"
"war" "zur"
"waren" "über"
"warst" ""
"warum" ""
"was" ""
"weil" ""
"welche" ""
"welchem" ""
"welchen" ""
language
— Остановите язык слова'en'
(значение по умолчанию) | 'ja'
| 'de'
| 'ko'
Остановите язык слова, заданный как одно из следующего:
'en'
– Английский язык
'ja'
– Японский язык
'de'
– Немецкий язык
'ko'
– Корейский язык
Для получения дополнительной информации о поддержке языка в Text Analytics Toolbox™, смотрите Факторы Языка.
stopWords
и removeStopWords
функции поддерживают английский, японский язык, немецкий язык и корейские слова остановки только.
Чтобы удалить слова остановки из других языков, используйте removeWords
и задайте свои собственные слова остановки, чтобы удалить.
bagOfNgrams
| bagOfWords
| normalizeWords
| removeLongWords
| removeShortWords
| removeStopWords
| removeWords
| tokenizedDocument
У вас есть модифицированная версия этого примера. Вы хотите открыть этот пример со своими редактированиями?
1. Если смысл перевода понятен, то лучше оставьте как есть и не придирайтесь к словам, синонимам и тому подобному. О вкусах не спорим.
2. Не дополняйте перевод комментариями “от себя”. В исправлении не должно появляться дополнительных смыслов и комментариев, отсутствующих в оригинале. Такие правки не получится интегрировать в алгоритме автоматического перевода.
3. Сохраняйте структуру оригинального текста - например, не разбивайте одно предложение на два.
4. Не имеет смысла однотипное исправление перевода какого-то термина во всех предложениях. Исправляйте только в одном месте. Когда Вашу правку одобрят, это исправление будет алгоритмически распространено и на другие части документации.
5. По иным вопросам, например если надо исправить заблокированное для перевода слово, обратитесь к редакторам через форму технической поддержки.