{"id":26993,"date":"2025-04-15T14:56:29","date_gmt":"2025-04-15T07:56:29","guid":{"rendered":"https:\/\/interdata.vn\/blog\/?p=26993"},"modified":"2025-04-15T14:56:29","modified_gmt":"2025-04-15T07:56:29","slug":"nltk-la-gi","status":"publish","type":"post","link":"https:\/\/interdata.vn\/blog\/nltk-la-gi\/","title":{"rendered":"NLTK l\u00e0 g\u00ec? A-Z v\u1ec1 th\u01b0 vi\u1ec7n Natural Language Toolkit"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-white ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">N\u1ed8I DUNG<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 eztoc-toggle-hide-by-default' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#NLTK-la-gi\" >NLTK l\u00e0 g\u00ec?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Mot-so-tinh-nang-chinh-cua-NLTK\" >M\u1ed9t s\u1ed1 t\u00ednh n\u0103ng ch\u00ednh c\u1ee7a NLTK<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Tokenization-Phan-manh-van-ban\" >Tokenization (Ph\u00e2n m\u1ea3nh v\u0103n b\u1ea3n)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Parsing-Phan-tich-cu-phap\" >Parsing (Ph\u00e2n t\u00edch c\u00fa ph\u00e1p)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Part-of-Speech-Tagging-Gan-the-tu-loai\" >Part-of-Speech Tagging (G\u1eafn th\u1ebb t\u1eeb lo\u1ea1i)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Text-Summarization-Tom-tat-van-ban\" >Text Summarization (T\u00f3m t\u1eaft v\u0103n b\u1ea3n)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Lemmatization-Chuyen-doi-tu-ve-co-ban\" >Lemmatization (Chuy\u1ec3n \u0111\u1ed5i t\u1eeb v\u1ec1 c\u01a1 b\u1ea3n)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Text-Classification-Phan-loai-van-ban\" >Text Classification (Ph\u00e2n lo\u1ea1i v\u0103n b\u1ea3n)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Named-Entity-Recognition-Nhan-dien-thuc-the-ten\" >Named Entity Recognition (Nh\u1eadn di\u1ec7n th\u1ef1c th\u1ec3 t\u00ean)<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Natural-Language-Toolkit-su-dung-cong-cu-nao\" >Natural Language Toolkit s\u1eed d\u1ee5ng c\u00f4ng c\u1ee5 n\u00e0o?<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#NLTK-Data\" >NLTK Data<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#NLTK-Corpora\" >NLTK Corpora<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#NLTK-Models\" >NLTK Models<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Mot-so-ung-dung-noi-bat-cua-thu-vien-NLTK\" >M\u1ed9t s\u1ed1 \u1ee9ng d\u1ee5ng n\u1ed5i b\u1eadt c\u1ee7a th\u01b0 vi\u1ec7n NLTK<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#-Giao-duc-va-Nghien-cuu-NLP\" >\u00a0Gi\u00e1o d\u1ee5c v\u00e0 Nghi\u00ean c\u1ee9u NLP<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#-Tien-xu-ly-Van-ban-cho-Hoc-may\" >\u00a0Ti\u1ec1n x\u1eed l\u00fd V\u0103n b\u1ea3n cho H\u1ecdc m\u00e1y<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#-Phan-tich-va-Kham-pha-du-lieu-Van-ban\" >\u00a0Ph\u00e2n t\u00edch v\u00e0 Kh\u00e1m ph\u00e1 d\u1eef li\u1ec7u V\u0103n b\u1ea3n<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#-Xay-dung-Nguyen-mau-he-thong-NLP\" >\u00a0X\u00e2y d\u1ef1ng Nguy\u00ean m\u1eabu h\u1ec7 th\u1ed1ng NLP<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#-Phat-trien-cong-cu-ngon-ngu-hoc\" >\u00a0Ph\u00e1t tri\u1ec3n c\u00f4ng c\u1ee5 ng\u00f4n ng\u1eef h\u1ecdc<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-20\" href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/#Ho-tro-tac-vu-Hoc-may-co-ban\" >H\u1ed7 tr\u1ee3 t\u00e1c v\u1ee5 H\u1ecdc m\u00e1y c\u01a1 b\u1ea3n<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<p>B\u1ea1n \u0111ang t\u00ecm hi\u1ec3u v\u1ec1 NLTK (Natural Language Toolkit) v\u00e0 vai tr\u00f2 c\u1ee7a n\u00f3 trong x\u1eed l\u00fd ng\u00f4n ng\u1eef t\u1ef1 nhi\u00ean (NLP)? B\u00e0i vi\u1ebft n\u00e0y s\u1ebd gi\u1ea3i th\u00edch chi ti\u1ebft <a href=\"https:\/\/interdata.vn\/blog\/nltk-la-gi\/\"><strong>NLTK l\u00e0 g\u00ec<\/strong><\/a>, kh\u00e1m ph\u00e1 c\u00e1c t\u00ednh n\u0103ng c\u1ed1t l\u00f5i c\u1ee7a Natural Language Toolkit c\u0169ng nh\u01b0 nh\u1eefng \u1ee9ng d\u1ee5ng n\u1ed5i b\u1eadt c\u1ee7a th\u01b0 vi\u1ec7n Python m\u1ea1nh m\u1ebd n\u00e0y, \u0111\u1eb7c bi\u1ec7t trong gi\u00e1o d\u1ee5c v\u00e0 nghi\u00ean c\u1ee9u. \u0110\u1ecdc ngay!<\/p>\n<h2><span class=\"ez-toc-section\" id=\"NLTK-la-gi\"><\/span>NLTK l\u00e0 g\u00ec?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><strong>NLTK (vi\u1ebft t\u1eaft c\u1ee7a Natural Language Toolkit) l\u00e0 m\u1ed9t b\u1ed9 th\u01b0 vi\u1ec7n v\u00e0 ch\u01b0\u01a1ng tr\u00ecnh <a href=\"https:\/\/interdata.vn\/blog\/open-source-la-gi\/\">m\u00e3 ngu\u1ed3n m\u1edf<\/a> h\u00e0ng \u0111\u1ea7u d\u00e0nh cho <a href=\"https:\/\/interdata.vn\/blog\/ngon-ngu-lap-trinh-la-gi\/\">ng\u00f4n ng\u1eef l\u1eadp tr\u00ecnh<\/a> Python<\/strong>. N\u00f3 \u0111\u01b0\u1ee3c thi\u1ebft k\u1ebf chuy\u00ean bi\u1ec7t \u0111\u1ec3 h\u1ed7 tr\u1ee3 c\u00e1c t\u00e1c v\u1ee5 trong l\u0129nh v\u1ef1c X\u1eed l\u00fd Ng\u00f4n ng\u1eef T\u1ef1 nhi\u00ean (NLP &#8211; Natural Language Processing).<\/p>\n<p>NLTK \u0111\u01b0\u1ee3c ph\u00e1t tri\u1ec3n ban \u0111\u1ea7u b\u1edfi Steven Bird v\u00e0 c\u00e1c c\u1ed9ng s\u1ef1 t\u1ea1i \u0110\u1ea1i h\u1ecdc Pennsylvania (University of Pennsylvania). M\u1ee5c ti\u00eau c\u1ed1t l\u00f5i l\u00e0 x\u00e2y d\u1ef1ng m\u1ed9t c\u00f4ng c\u1ee5 m\u1ea1nh m\u1ebd nh\u01b0ng d\u1ec5 ti\u1ebfp c\u1eadn, ph\u1ee5c v\u1ee5 ch\u1ee7 y\u1ebfu cho vi\u1ec7c gi\u1ea3ng d\u1ea1y v\u00e0 nghi\u00ean c\u1ee9u trong ng\u00e0nh NLP v\u00e0 Ng\u00f4n ng\u1eef h\u1ecdc T\u00ednh to\u00e1n.<\/p>\n<figure id=\"attachment_27005\" aria-describedby=\"caption-attachment-27005\" style=\"width: 800px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Natural-Language-Toolkit-NLTK-la-gi-.png\" alt=\"(Natural Language Toolkit) NLTK l\u00e0 g\u00ec \" width=\"800\" height=\"500\" class=\"size-full wp-image-27005\" title=\"\" srcset=\"https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Natural-Language-Toolkit-NLTK-la-gi-.png 800w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Natural-Language-Toolkit-NLTK-la-gi--300x188.png 300w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Natural-Language-Toolkit-NLTK-la-gi--768x480.png 768w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Natural-Language-Toolkit-NLTK-la-gi--750x469.png 750w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption id=\"caption-attachment-27005\" class=\"wp-caption-text\">(Natural Language Toolkit) NLTK l\u00e0 g\u00ec?<\/figcaption><\/figure>\n<p>L\u00e0 m\u1ed9t d\u1ef1 \u00e1n <a href=\"https:\/\/interdata.vn\/blog\/source-code-la-gi\/\">m\u00e3 ngu\u1ed3n<\/a> m\u1edf ho\u00e0n to\u00e0n, th\u01b0 vi\u1ec7n NLTK hi\u1ec7n \u0111\u01b0\u1ee3c c\u1ea5p ph\u00e9p theo gi\u1ea5y ph\u00e9p <a href=\"https:\/\/interdata.vn\/blog\/apache-la-gi\/\">Apache<\/a> 2.0. \u0110i\u1ec1u n\u00e0y cho ph\u00e9p c\u1ed9ng \u0111\u1ed3ng to\u00e0n c\u1ea7u t\u1ef1 do s\u1eed d\u1ee5ng, sao ch\u00e9p, s\u1eeda \u0111\u1ed5i v\u00e0 ph\u00e2n ph\u1ed1i l\u1ea1i th\u01b0 vi\u1ec7n m\u00e0 kh\u00f4ng ph\u1ea3i tr\u1ea3 b\u1ea5t k\u1ef3 chi ph\u00ed b\u1ea3n quy\u1ec1n n\u00e0o, \u00e1p d\u1ee5ng cho c\u1ea3 m\u1ee5c \u0111\u00edch h\u1ecdc thu\u1eadt v\u00e0 th\u01b0\u01a1ng m\u1ea1i.<\/p>\n<p>C\u00f4ng c\u1ee5 NLTK cung c\u1ea5p c\u00e1c kh\u1ed1i x\u00e2y d\u1ef1ng (building blocks) c\u01a1 b\u1ea3n v\u00e0 thi\u1ebft y\u1ebfu cho vi\u1ec7c ph\u00e2n t\u00edch ng\u00f4n ng\u1eef ng\u01b0\u1eddi. N\u00f3 bao g\u1ed3m c\u00e1c module cho nhi\u1ec1u t\u00e1c v\u1ee5 n\u1ec1n t\u1ea3ng nh\u01b0 t\u00e1ch t\u1eeb, ph\u00e2n lo\u1ea1i t\u1eeb lo\u1ea1i, ph\u00e2n t\u00edch c\u00fa ph\u00e1p, gi\u00fap ng\u01b0\u1eddi d\u00f9ng hi\u1ec3u v\u00e0 th\u1ef1c h\u00e0nh c\u00e1c k\u1ef9 thu\u1eadt NLP c\u1ed1t l\u00f5i.<\/p>\n<p>M\u1ed9t th\u1ebf m\u1ea1nh \u0111\u1eb7c bi\u1ec7t c\u1ee7a th\u01b0 vi\u1ec7n NLTK l\u00e0 kh\u1ea3 n\u0103ng t\u00edch h\u1ee3p v\u00e0 truy c\u1eadp d\u1ec5 d\u00e0ng v\u00e0o m\u1ed9t kho t\u00e0ng l\u1edbn g\u1ed3m h\u01a1n 50 b\u1ed9 d\u1eef li\u1ec7u ng\u00f4n ng\u1eef (corpora) v\u00e0 t\u00e0i nguy\u00ean t\u1eeb v\u1ef1ng (lexical resources) nh\u01b0 WordNet. \u0110i\u1ec1u n\u00e0y c\u1ef1c k\u1ef3 h\u1eefu \u00edch cho vi\u1ec7c h\u1ecdc t\u1eadp, nghi\u00ean c\u1ee9u v\u00e0 th\u1eed nghi\u1ec7m <a href=\"https:\/\/interdata.vn\/blog\/thuat-toan-algorithm\/\">thu\u1eadt to\u00e1n<\/a>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Mot-so-tinh-nang-chinh-cua-NLTK\"><\/span><strong>M\u1ed9t s\u1ed1 t\u00ednh n\u0103ng ch\u00ednh c\u1ee7a NLTK<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>NLTK (Natural Language Toolkit) cung c\u1ea5p m\u1ed9t b\u1ed9 c\u00f4ng c\u1ee5 phong ph\u00fa cho nhi\u1ec1u t\u00e1c v\u1ee5 x\u1eed l\u00fd ng\u00f4n ng\u1eef t\u1ef1 nhi\u00ean c\u01a1 b\u1ea3n v\u00e0 n\u00e2ng cao. Nh\u1eefng t\u00ednh n\u0103ng n\u00e0y \u0111\u00f3ng vai tr\u00f2 l\u00e0 c\u00e1c th\u00e0nh ph\u1ea7n n\u1ec1n t\u1ea3ng, cho ph\u00e9p ng\u01b0\u1eddi d\u00f9ng ph\u00e2n t\u00edch, hi\u1ec3u v\u00e0 thao t\u00e1c v\u1edbi d\u1eef li\u1ec7u v\u0103n b\u1ea3n m\u1ed9t c\u00e1ch hi\u1ec7u qu\u1ea3.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Tokenization-Phan-manh-van-ban\"><\/span><strong>Tokenization (Ph\u00e2n m\u1ea3nh v\u0103n b\u1ea3n)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Tokenization l\u00e0 qu\u00e1 tr\u00ecnh c\u01a1 b\u1ea3n nh\u1ea5t, d\u00f9ng \u0111\u1ec3 chia nh\u1ecf v\u0103n b\u1ea3n th\u00e0nh c\u00e1c \u0111\u01a1n v\u1ecb c\u00f3 \u00fd ngh\u0129a g\u1ecdi l\u00e0 token. NLTK cung c\u1ea5p c\u00e1c b\u1ed9 token h\u00f3a (tokenizer) hi\u1ec7u qu\u1ea3 \u0111\u1ec3 t\u00e1ch v\u0103n b\u1ea3n th\u00e0nh c\u00e2u ho\u1eb7c th\u00e0nh c\u00e1c t\u1eeb v\u00e0 d\u1ea5u c\u00e2u ri\u00eang bi\u1ec7t, chu\u1ea9n b\u1ecb cho c\u00e1c b\u01b0\u1edbc ph\u00e2n t\u00edch s\u00e2u h\u01a1n.<\/p>\n<p>V\u00ed d\u1ee5, c\u00e2u &#8220;NLTK r\u1ea5t h\u1eefu \u00edch.&#8221; c\u00f3 th\u1ec3 \u0111\u01b0\u1ee3c t\u00e1ch th\u00e0nh danh s\u00e1ch c\u00e1c token t\u1eeb: <code>['NLTK', 'r\u1ea5t', 'h\u1eefu', '\u00edch', '.']<\/code>. Qu\u00e1 tr\u00ecnh n\u00e0y l\u00e0 b\u01b0\u1edbc ti\u1ec1n x\u1eed l\u00fd kh\u00f4ng th\u1ec3 thi\u1ebfu trong h\u1ea7u h\u1ebft c\u00e1c \u1ee9ng d\u1ee5ng NLP, gi\u00fap m\u00e1y t\u00ednh l\u00e0m vi\u1ec7c v\u1edbi t\u1eebng \u0111\u01a1n v\u1ecb ng\u00f4n ng\u1eef.<\/p>\n<figure id=\"attachment_27011\" aria-describedby=\"caption-attachment-27011\" style=\"width: 800px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-tinh-nang-chinh-cua-NLTK.jpg\" alt=\"M\u1ed9t s\u1ed1 t\u00ednh n\u0103ng ch\u00ednh c\u1ee7a NLTK\" width=\"800\" height=\"500\" class=\"size-full wp-image-27011\" title=\"\" srcset=\"https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-tinh-nang-chinh-cua-NLTK.jpg 800w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-tinh-nang-chinh-cua-NLTK-300x188.jpg 300w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-tinh-nang-chinh-cua-NLTK-768x480.jpg 768w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-tinh-nang-chinh-cua-NLTK-750x469.jpg 750w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption id=\"caption-attachment-27011\" class=\"wp-caption-text\">M\u1ed9t s\u1ed1 t\u00ednh n\u0103ng ch\u00ednh c\u1ee7a NLTK<\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"Parsing-Phan-tich-cu-phap\"><\/span><strong>Parsing (Ph\u00e2n t\u00edch c\u00fa ph\u00e1p)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Parsing, hay ph\u00e2n t\u00edch c\u00fa ph\u00e1p, l\u00e0 qu\u00e1 tr\u00ecnh x\u00e1c \u0111\u1ecbnh c\u1ea5u tr\u00fac ng\u1eef ph\u00e1p c\u1ee7a m\u1ed9t c\u00e2u. NLTK cung c\u1ea5p c\u00e1c thu\u1eadt to\u00e1n v\u00e0 ng\u1eef ph\u00e1p m\u1eabu \u0111\u1ec3 x\u00e2y d\u1ef1ng c\u00e2y ph\u00e2n t\u00edch c\u00fa ph\u00e1p (parse tree), qua \u0111\u00f3 bi\u1ec3u di\u1ec5n m\u1ed1i quan h\u1ec7 ph\u00e2n c\u1ea5p v\u00e0 ph\u1ee5 thu\u1ed9c gi\u1eefa c\u00e1c t\u1eeb, c\u1ee5m t\u1eeb trong c\u00e2u.<\/p>\n<p>Vi\u1ec7c hi\u1ec3u c\u1ea5u tr\u00fac c\u00fa ph\u00e1p r\u1ea5t quan tr\u1ecdng \u0111\u1ec3 di\u1ec5n gi\u1ea3i \u00fd ngh\u0129a ch\u00ednh x\u00e1c c\u1ee7a c\u00e2u. T\u00ednh n\u0103ng n\u00e0y \u0111\u01b0\u1ee3c \u1ee9ng d\u1ee5ng trong c\u00e1c h\u1ec7 th\u1ed1ng ki\u1ec3m tra ng\u1eef ph\u00e1p, d\u1ecbch m\u00e1y (machine translation) ho\u1eb7c c\u00e1c h\u1ec7 th\u1ed1ng h\u1ecfi \u0111\u00e1p c\u1ea7n hi\u1ec3u s\u00e2u v\u1ec1 c\u1ea5u tr\u00fac c\u00e2u h\u1ecfi v\u00e0 c\u00e2u tr\u1ea3 l\u1eddi.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Part-of-Speech-Tagging-Gan-the-tu-loai\"><\/span><strong>Part-of-Speech Tagging (G\u1eafn th\u1ebb t\u1eeb lo\u1ea1i)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Part-of-Speech (POS) Tagging l\u00e0 qu\u00e1 tr\u00ecnh g\u00e1n m\u1ed9t nh\u00e3n t\u1eeb lo\u1ea1i (nh\u01b0 danh t\u1eeb, \u0111\u1ed9ng t\u1eeb, t\u00ednh t\u1eeb, tr\u1ea1ng t\u1eeb&#8230;) cho m\u1ed7i t\u1eeb trong v\u0103n b\u1ea3n. NLTK t\u00edch h\u1ee3p c\u00e1c b\u1ed9 g\u1eafn th\u1ebb (tagger) d\u1ef1a tr\u00ean c\u00e1c m\u00f4 h\u00ecnh th\u1ed1ng k\u00ea \u0111\u1ec3 th\u1ef1c hi\u1ec7n vi\u1ec7c n\u00e0y m\u1ed9t c\u00e1ch t\u1ef1 \u0111\u1ed9ng v\u00e0 t\u01b0\u01a1ng \u0111\u1ed1i ch\u00ednh x\u00e1c.<\/p>\n<p>V\u00ed d\u1ee5, t\u1eeb &#8220;fly&#8221; c\u00f3 th\u1ec3 l\u00e0 \u0111\u1ed9ng t\u1eeb (bay) ho\u1eb7c danh t\u1eeb (con ru\u1ed3i). Vi\u1ec7c g\u1eafn th\u1ebb t\u1eeb lo\u1ea1i gi\u00fap l\u00e0m r\u00f5 ngh\u0129a c\u1ee7a t\u1eeb trong ng\u1eef c\u1ea3nh c\u1ee5 th\u1ec3. N\u00f3 l\u00e0 \u0111\u1ea7u v\u00e0o quan tr\u1ecdng cho nhi\u1ec1u t\u00e1c v\u1ee5 NLP ph\u1ee9c t\u1ea1p h\u01a1n nh\u01b0 lemmatization hay nh\u1eadn d\u1ea1ng th\u1ef1c th\u1ec3 t\u00ean.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Text-Summarization-Tom-tat-van-ban\"><\/span><strong>Text Summarization (T\u00f3m t\u1eaft v\u0103n b\u1ea3n)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Th\u01b0 vi\u1ec7n NLTK cung c\u1ea5p c\u00e1c c\u00f4ng c\u1ee5 n\u1ec1n t\u1ea3ng c\u00f3 th\u1ec3 \u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng \u0111\u1ec3 x\u00e2y d\u1ef1ng c\u00e1c h\u1ec7 th\u1ed1ng t\u00f3m t\u1eaft v\u0103n b\u1ea3n \u0111\u01a1n gi\u1ea3n. C\u00e1c k\u1ef9 thu\u1eadt th\u01b0\u1eddng d\u1ef1a tr\u00ean ph\u01b0\u01a1ng ph\u00e1p tr\u00edch xu\u1ea5t (extractive), v\u00ed d\u1ee5 nh\u01b0 ch\u1ecdn ra c\u00e1c c\u00e2u quan tr\u1ecdng nh\u1ea5t d\u1ef1a tr\u00ean t\u1ea7n su\u1ea5t xu\u1ea5t hi\u1ec7n c\u1ee7a t\u1eeb ho\u1eb7c v\u1ecb tr\u00ed c\u00e2u.<\/p>\n<p>Th\u01b0 vi\u1ec7n kh\u00f4ng cung c\u1ea5p s\u1eb5n m\u1ed9t h\u00e0m t\u00f3m t\u1eaft ph\u1ee9c t\u1ea1p ch\u1ec9 b\u1eb1ng m\u1ed9t l\u1ec7nh g\u1ecdi. Thay v\u00e0o \u0111\u00f3, n\u00f3 trang b\u1ecb c\u00e1c module \u0111\u1ec3 t\u00ednh to\u00e1n \u0111i\u1ec3m s\u1ed1 c\u00e2u, x\u1eed l\u00fd v\u0103n b\u1ea3n, gi\u00fap ng\u01b0\u1eddi d\u00f9ng t\u1ef1 x\u00e2y d\u1ef1ng logic t\u00f3m t\u1eaft ph\u00f9 h\u1ee3p v\u1edbi nhu c\u1ea7u c\u1ee7a m\u00ecnh theo ph\u01b0\u01a1ng ph\u00e1p tr\u00edch xu\u1ea5t.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Lemmatization-Chuyen-doi-tu-ve-co-ban\"><\/span><strong>Lemmatization (Chuy\u1ec3n \u0111\u1ed5i t\u1eeb v\u1ec1 c\u01a1 b\u1ea3n)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Lemmatization l\u00e0 qu\u00e1 tr\u00ecnh chu\u1ea9n h\u00f3a t\u1eeb b\u1eb1ng c\u00e1ch \u0111\u01b0a ch\u00fang v\u1ec1 d\u1ea1ng t\u1eeb \u0111i\u1ec3n g\u1ed1c (lemma). Kh\u00e1c v\u1edbi stemming (ch\u1ec9 c\u1eaft b\u1ecf h\u1eadu t\u1ed1), lemmatization s\u1eed d\u1ee5ng c\u01a1 s\u1edf d\u1eef li\u1ec7u t\u1eeb v\u1ef1ng (nh\u01b0 WordNet) v\u00e0 xem x\u00e9t t\u1eeb lo\u1ea1i \u0111\u1ec3 \u0111\u1ea3m b\u1ea3o d\u1ea1ng g\u1ed1c tr\u1ea3 v\u1ec1 l\u00e0 m\u1ed9t t\u1eeb c\u00f3 ngh\u0129a v\u00e0 \u0111\u00fang ng\u1eef ph\u00e1p.<\/p>\n<p>V\u00ed d\u1ee5, c\u00e1c t\u1eeb &#8220;am&#8221;, &#8220;is&#8221;, &#8220;are&#8221; s\u1ebd \u0111\u01b0\u1ee3c \u0111\u01b0a v\u1ec1 &#8220;be&#8221;; &#8220;running&#8221; s\u1ebd v\u1ec1 &#8220;run&#8221;; &#8220;better&#8221; v\u1ec1 &#8220;good&#8221;. K\u1ebft qu\u1ea3 chu\u1ea9n h\u00f3a n\u00e0y gi\u00fap c\u1ea3i thi\u1ec7n hi\u1ec7u qu\u1ea3 c\u1ee7a c\u00e1c t\u00e1c v\u1ee5 nh\u01b0 t\u00ecm ki\u1ebfm th\u00f4ng tin ho\u1eb7c m\u00f4 h\u00ecnh h\u00f3a ch\u1ee7 \u0111\u1ec1 (topic modeling).<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Text-Classification-Phan-loai-van-ban\"><\/span><strong>Text Classification (Ph\u00e2n lo\u1ea1i v\u0103n b\u1ea3n)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Ph\u00e2n lo\u1ea1i v\u0103n b\u1ea3n l\u00e0 qu\u00e1 tr\u00ecnh t\u1ef1 \u0111\u1ed9ng g\u00e1n c\u00e1c nh\u00e3n ho\u1eb7c danh m\u1ee5c \u0111\u01b0\u1ee3c \u0111\u1ecbnh tr\u01b0\u1edbc cho m\u1ed9t \u0111o\u1ea1n v\u0103n b\u1ea3n. NLTK h\u1ed7 tr\u1ee3 t\u00ednh n\u0103ng n\u00e0y b\u1eb1ng c\u00e1ch cung c\u1ea5p c\u00f4ng c\u1ee5 tr\u00edch xu\u1ea5t \u0111\u1eb7c tr\u01b0ng (feature extraction) t\u1eeb v\u0103n b\u1ea3n v\u00e0 t\u00edch h\u1ee3p c\u00e1c thu\u1eadt to\u00e1n h\u1ecdc m\u00e1y c\u01a1 b\u1ea3n.<\/p>\n<p>Ng\u01b0\u1eddi d\u00f9ng c\u00f3 th\u1ec3 d\u00f9ng Natural Language Toolkit \u0111\u1ec3 chu\u1ea9n b\u1ecb d\u1eef li\u1ec7u v\u00e0 x\u00e2y d\u1ef1ng c\u00e1c m\u00f4 h\u00ecnh ph\u00e2n lo\u1ea1i \u0111\u01a1n gi\u1ea3n. C\u00e1c \u1ee9ng d\u1ee5ng ph\u1ed5 bi\u1ebfn bao g\u1ed3m ph\u00e2n lo\u1ea1i th\u01b0 r\u00e1c, ph\u00e2n t\u00edch c\u1ea3m x\u00fac (x\u00e1c \u0111\u1ecbnh v\u0103n b\u1ea3n mang t\u00ednh t\u00edch c\u1ef1c, ti\u00eau c\u1ef1c hay trung t\u00ednh), ho\u1eb7c ph\u00e2n lo\u1ea1i tin t\u1ee9c theo ch\u1ee7 \u0111\u1ec1.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Named-Entity-Recognition-Nhan-dien-thuc-the-ten\"><\/span><strong>Named Entity Recognition (Nh\u1eadn di\u1ec7n th\u1ef1c th\u1ec3 t\u00ean)<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Named Entity Recognition (NER) l\u00e0 t\u00e1c v\u1ee5 x\u00e1c \u0111\u1ecbnh v\u00e0 ph\u00e2n lo\u1ea1i c\u00e1c th\u1ef1c th\u1ec3 \u0111\u01b0\u1ee3c \u0111\u1eb7t t\u00ean (named entities) trong v\u0103n b\u1ea3n th\u00e0nh c\u00e1c lo\u1ea1i nh\u01b0 t\u00ean ng\u01b0\u1eddi, t\u00ean t\u1ed5 ch\u1ee9c, \u0111\u1ecba \u0111i\u1ec3m, ng\u00e0y th\u00e1ng, ti\u1ec1n t\u1ec7, v.v. Natural Language Toolkit cung c\u1ea5p c\u00e1c c\u00f4ng c\u1ee5 \u0111\u01b0\u1ee3c hu\u1ea5n luy\u1ec7n s\u1eb5n \u0111\u1ec3 th\u1ef1c hi\u1ec7n NER c\u01a1 b\u1ea3n.<\/p>\n<p>T\u00ednh n\u0103ng n\u00e0y r\u1ea5t h\u1eefu \u00edch cho vi\u1ec7c tr\u00edch xu\u1ea5t th\u00f4ng tin t\u1ef1 \u0111\u1ed9ng. V\u00ed d\u1ee5, m\u1ed9t h\u1ec7 th\u1ed1ng c\u00f3 th\u1ec3 qu\u00e9t qua c\u00e1c b\u00e0i b\u00e1o \u0111\u1ec3 t\u1ef1 \u0111\u1ed9ng t\u00ecm v\u00e0 li\u1ec7t k\u00ea t\u1ea5t c\u1ea3 c\u00e1c t\u00ean c\u00f4ng ty ho\u1eb7c \u0111\u1ecba danh \u0111\u01b0\u1ee3c \u0111\u1ec1 c\u1eadp, ph\u1ee5c v\u1ee5 cho vi\u1ec7c x\u00e2y d\u1ef1ng c\u01a1 s\u1edf tri th\u1ee9c ho\u1eb7c ph\u00e2n t\u00edch th\u1ecb tr\u01b0\u1eddng.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Natural-Language-Toolkit-su-dung-cong-cu-nao\"><\/span><strong>Natural Language Toolkit s\u1eed d\u1ee5ng c\u00f4ng c\u1ee5 n\u00e0o?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Sau khi \u0111\u00e3 hi\u1ec3u r\u00f5 v\u1ec1 nh\u1eefng t\u00ednh n\u0103ng c\u1ee7a th\u01b0 vi\u1ec7n NLTK l\u00e0 g\u00ec, h\u00e3y c\u00f9ng t\u00ecm hi\u1ec3u v\u1ec1 nh\u1eefng c\u00f4ng c\u1ee5 m\u00e0 NLTK s\u1eed d\u1ee5ng. \u0110\u1ec3 th\u1ef1c hi\u1ec7n c\u00e1c ch\u1ee9c n\u0103ng x\u1eed l\u00fd ng\u00f4n ng\u1eef t\u1ef1 nhi\u00ean ph\u1ee9c t\u1ea1p, NLTK kh\u00f4ng ch\u1ec9 d\u1ef1a v\u00e0o m\u00e3 ngu\u1ed3n c\u1ee7a th\u01b0 vi\u1ec7n ch\u00ednh. N\u00f3 c\u00f2n s\u1eed d\u1ee5ng v\u00e0 cung c\u1ea5p m\u1ed9t h\u1ec7 sinh th\u00e1i c\u00e1c &#8220;c\u00f4ng c\u1ee5&#8221; b\u1ed5 tr\u1ee3 quan tr\u1ecdng, bao g\u1ed3m c\u00e1c g\u00f3i d\u1eef li\u1ec7u, kho ng\u1eef li\u1ec7u v\u00e0 m\u00f4 h\u00ecnh \u0111\u01b0\u1ee3c hu\u1ea5n luy\u1ec7n s\u1eb5n.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"NLTK-Data\"><\/span><strong>NLTK Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>NLTK Data l\u00e0 t\u00ean g\u1ecdi chung cho t\u1ea5t c\u1ea3 c\u00e1c t\u00e0i nguy\u00ean b\u1ed5 sung m\u00e0 th\u01b0 vi\u1ec7n c\u1ea7n \u0111\u1ec3 ho\u1ea1t \u0111\u1ed9ng \u0111\u1ea7y \u0111\u1ee7. C\u00e1c t\u00e0i nguy\u00ean n\u00e0y kh\u00f4ng \u0111\u01b0\u1ee3c c\u00e0i \u0111\u1eb7t m\u1eb7c \u0111\u1ecbnh c\u00f9ng th\u01b0 vi\u1ec7n ch\u00ednh \u0111\u1ec3 gi\u1eef cho c\u00f4ng c\u1ee5 NLTK g\u1ecdn nh\u1eb9. Ng\u01b0\u1eddi d\u00f9ng c\u1ea7n t\u1ea3i ch\u00fang v\u1ec1 th\u00f4ng qua ti\u1ec7n \u00edch <code>nltk.download()<\/code>.<\/p>\n<p>Vi\u1ec7c t\u1ea3i NLTK Data cho ph\u00e9p b\u1ea1n l\u1ef1a ch\u1ecdn ch\u1ec9 nh\u1eefng g\u00f3i c\u1ea7n thi\u1ebft cho c\u00f4ng vi\u1ec7c c\u1ee7a m\u00ecnh. V\u00ed d\u1ee5, b\u1ea1n c\u00f3 th\u1ec3 c\u1ea7n t\u1ea3i g\u00f3i <code>punkt<\/code> cho vi\u1ec7c t\u00e1ch c\u00e2u, g\u00f3i <code>stopwords<\/code> ch\u1ee9a danh s\u00e1ch c\u00e1c t\u1eeb d\u1eebng ph\u1ed5 bi\u1ebfn, ho\u1eb7c c\u00e1c m\u00f4 h\u00ecnh v\u00e0 corpora c\u1ee5 th\u1ec3 kh\u00e1c.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"NLTK-Corpora\"><\/span><strong>NLTK Corpora<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>NLTK Corpora (C\u00e1c kho ng\u1eef li\u1ec7u) l\u00e0 m\u1ed9t ph\u1ea7n c\u1ef1c k\u1ef3 gi\u00e1 tr\u1ecb c\u1ee7a NLTK Data. \u0110\u00e2y l\u00e0 nh\u1eefng b\u1ed9 s\u01b0u t\u1eadp l\u1edbn ch\u1ee9a d\u1eef li\u1ec7u v\u0103n b\u1ea3n ho\u1eb7c l\u1eddi n\u00f3i th\u1ef1c t\u1ebf, th\u01b0\u1eddng \u0111i k\u00e8m v\u1edbi c\u00e1c ch\u00fa gi\u1ea3i (annotation) v\u1ec1 m\u1eb7t ng\u00f4n ng\u1eef h\u1ecdc (nh\u01b0 g\u00e1n nh\u00e3n t\u1eeb lo\u1ea1i, ph\u00e2n t\u00edch c\u00fa ph\u00e1p).<\/p>\n<p>Th\u01b0 vi\u1ec7n cung c\u1ea5p giao di\u1ec7n ti\u1ec7n l\u1ee3i \u0111\u1ec3 truy c\u1eadp h\u01a1n 50 corpora n\u1ed5i ti\u1ebfng. M\u1ed9t s\u1ed1 v\u00ed d\u1ee5 bao g\u1ed3m kho v\u0103n h\u1ecdc <a href=\"https:\/\/interdata.vn\/blog\/gutenberg-la-gi\/\">Gutenberg<\/a>, kho tin t\u1ee9c Brown Corpus, kho d\u1eef li\u1ec7u c\u00fa ph\u00e1p Penn Treebank, hay c\u01a1 s\u1edf d\u1eef li\u1ec7u t\u1eeb v\u1ef1ng WordNet, h\u1ed7 tr\u1ee3 \u0111\u1eafc l\u1ef1c cho nghi\u00ean c\u1ee9u v\u00e0 h\u1ecdc t\u1eadp.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"NLTK-Models\"><\/span><strong>NLTK Models<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>NLTK Models l\u00e0 c\u00e1c m\u00f4 h\u00ecnh \u0111\u00e3 \u0111\u01b0\u1ee3c hu\u1ea5n luy\u1ec7n tr\u01b0\u1edbc (pre-trained models) \u0111\u1ec3 th\u1ef1c hi\u1ec7n c\u00e1c t\u00e1c v\u1ee5 NLP c\u1ee5 th\u1ec3 m\u00e0 kh\u00f4ng c\u1ea7n ng\u01b0\u1eddi d\u00f9ng ph\u1ea3i hu\u1ea5n luy\u1ec7n l\u1ea1i t\u1eeb \u0111\u1ea7u. Ch\u00fang c\u0169ng l\u00e0 m\u1ed9t ph\u1ea7n c\u1ee7a NLTK Data v\u00e0 c\u1ea7n \u0111\u01b0\u1ee3c t\u1ea3i v\u1ec1 th\u00f4ng qua <code>nltk.download()<\/code> khi c\u1ea7n s\u1eed d\u1ee5ng.<\/p>\n<p>V\u00ed d\u1ee5 ph\u1ed5 bi\u1ebfn l\u00e0 c\u00e1c m\u00f4 h\u00ecnh d\u00f9ng \u0111\u1ec3 g\u00e1n nh\u00e3n t\u1eeb lo\u1ea1i (POS Tagger) ho\u1eb7c m\u00f4 h\u00ecnh nh\u1eadn d\u1ea1ng th\u1ef1c th\u1ec3 t\u00ean (NER Tagger). C\u00e1c m\u00f4 h\u00ecnh n\u00e0y th\u01b0\u1eddng \u0111\u01b0\u1ee3c hu\u1ea5n luy\u1ec7n tr\u00ean c\u00e1c corpora l\u1edbn, gi\u00fap ng\u01b0\u1eddi d\u00f9ng nhanh ch\u00f3ng \u00e1p d\u1ee5ng c\u00e1c k\u1ef9 thu\u1eadt NLP ti\u00ean ti\u1ebfn v\u00e0o \u1ee9ng d\u1ee5ng c\u1ee7a m\u00ecnh.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Mot-so-ung-dung-noi-bat-cua-thu-vien-NLTK\"><\/span><strong>M\u1ed9t s\u1ed1 \u1ee9ng d\u1ee5ng n\u1ed5i b\u1eadt c\u1ee7a th\u01b0 vi\u1ec7n NLTK<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>V\u1edbi b\u1ed9 c\u00f4ng c\u1ee5 NLP n\u1ec1n t\u1ea3ng v\u00e0 kho t\u00e0i nguy\u00ean ng\u00f4n ng\u1eef phong ph\u00fa, NLTK (Natural Language Toolkit) \u0111\u01b0\u1ee3c \u1ee9ng d\u1ee5ng r\u1ed9ng r\u00e3i trong nhi\u1ec1u d\u1ef1 \u00e1n th\u1ef1c t\u1ebf. C\u00e1c \u1ee9ng d\u1ee5ng n\u00e0y \u0111\u1eb7c bi\u1ec7t ph\u1ed5 bi\u1ebfn v\u00e0 hi\u1ec7u qu\u1ea3 trong l\u0129nh v\u1ef1c gi\u00e1o d\u1ee5c, nghi\u00ean c\u1ee9u ng\u00f4n ng\u1eef, v\u00e0 giai \u0111o\u1ea1n \u0111\u1ea7u ph\u00e1t tri\u1ec3n c\u00e1c h\u1ec7 th\u1ed1ng x\u1eed l\u00fd v\u0103n b\u1ea3n.<\/p>\n<p>D\u01b0\u1edbi \u0111\u00e2y l\u00e0 m\u1ed9t s\u1ed1 \u1ee9ng d\u1ee5ng c\u1ee7a Natural Language Toolkit<\/p>\n<h3><span class=\"ez-toc-section\" id=\"-Giao-duc-va-Nghien-cuu-NLP\"><\/span><strong>\u00a0Gi\u00e1o d\u1ee5c v\u00e0 Nghi\u00ean c\u1ee9u NLP<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>\u0110\u00e2y l\u00e0 l\u0129nh v\u1ef1c \u1ee9ng d\u1ee5ng m\u1ea1nh m\u1ebd nh\u1ea5t c\u1ee7a NLTK. Th\u01b0 vi\u1ec7n cung c\u1ea5p m\u00f4i tr\u01b0\u1eddng l\u00fd t\u01b0\u1edfng \u0111\u1ec3 gi\u1ea3ng d\u1ea1y v\u00e0 h\u1ecdc c\u00e1c kh\u00e1i ni\u1ec7m X\u1eed l\u00fd Ng\u00f4n ng\u1eef T\u1ef1 nhi\u00ean. Sinh vi\u00ean c\u00f3 th\u1ec3 d\u1ec5 d\u00e0ng th\u1ef1c h\u00e0nh t\u00e1ch t\u1eeb, g\u00e1n nh\u00e3n t\u1eeb lo\u1ea1i, ph\u00e2n t\u00edch c\u00fa ph\u00e1p, v.v., gi\u00fap c\u1ee7ng c\u1ed1 ki\u1ebfn th\u1ee9c l\u00fd thuy\u1ebft.<\/p>\n<p>C\u00e1c nh\u00e0 nghi\u00ean c\u1ee9u c\u0169ng th\u01b0\u1eddng xuy\u00ean s\u1eed d\u1ee5ng th\u01b0 vi\u1ec7n NLTK. H\u1ecd d\u00f9ng n\u00f3 \u0111\u1ec3 nhanh ch\u00f3ng th\u1eed nghi\u1ec7m c\u00e1c thu\u1eadt to\u00e1n m\u1edbi, ph\u00e2n t\u00edch c\u00e1c b\u1ed9 d\u1eef li\u1ec7u ng\u00f4n ng\u1eef (corpora) c\u00f3 s\u1eb5n, ho\u1eb7c x\u00e2y d\u1ef1ng c\u00e1c c\u00f4ng c\u1ee5 ph\u00e2n t\u00edch ng\u00f4n ng\u1eef chuy\u00ean bi\u1ec7t cho c\u00f4ng tr\u00ecnh khoa h\u1ecdc c\u1ee7a m\u00ecnh.<\/p>\n<figure id=\"attachment_27012\" aria-describedby=\"caption-attachment-27012\" style=\"width: 800px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-ung-dung-noi-bat-cua-thu-vien-NLTK.jpg\" alt=\"M\u1ed9t s\u1ed1 \u1ee9ng d\u1ee5ng n\u1ed5i b\u1eadt c\u1ee7a th\u01b0 vi\u1ec7n NLTK\" width=\"800\" height=\"500\" class=\"size-full wp-image-27012\" title=\"\" srcset=\"https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-ung-dung-noi-bat-cua-thu-vien-NLTK.jpg 800w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-ung-dung-noi-bat-cua-thu-vien-NLTK-300x188.jpg 300w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-ung-dung-noi-bat-cua-thu-vien-NLTK-768x480.jpg 768w, https:\/\/interdata.vn\/blog\/wp-content\/uploads\/2025\/04\/Mot-so-ung-dung-noi-bat-cua-thu-vien-NLTK-750x469.jpg 750w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption id=\"caption-attachment-27012\" class=\"wp-caption-text\">M\u1ed9t s\u1ed1 \u1ee9ng d\u1ee5ng n\u1ed5i b\u1eadt c\u1ee7a th\u01b0 vi\u1ec7n NLTK<\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"-Tien-xu-ly-Van-ban-cho-Hoc-may\"><\/span><strong>\u00a0Ti\u1ec1n x\u1eed l\u00fd V\u0103n b\u1ea3n cho H\u1ecdc m\u00e1y<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>M\u1ed9t \u1ee9ng d\u1ee5ng c\u1ef1c k\u1ef3 quan tr\u1ecdng l\u00e0 ti\u1ec1n x\u1eed l\u00fd d\u1eef li\u1ec7u v\u0103n b\u1ea3n th\u00f4. Tr\u01b0\u1edbc khi \u0111\u01b0a v\u0103n b\u1ea3n v\u00e0o c\u00e1c m\u00f4 h\u00ecnh h\u1ecdc m\u00e1y, vi\u1ec7c l\u00e0m s\u1ea1ch v\u00e0 chu\u1ea9n h\u00f3a l\u00e0 c\u1ea7n thi\u1ebft. NLTK cung c\u1ea5p c\u00e1c c\u00f4ng c\u1ee5 m\u1ea1nh m\u1ebd cho vi\u1ec7c n\u00e0y nh\u01b0 t\u00e1ch t\u1eeb, lo\u1ea1i b\u1ecf t\u1eeb d\u1eebng (stopwords), v\u00e0 chu\u1ea9n h\u00f3a t\u1eeb (stemming\/lemmatization).<\/p>\n<h3><span class=\"ez-toc-section\" id=\"-Phan-tich-va-Kham-pha-du-lieu-Van-ban\"><\/span><strong>\u00a0Ph\u00e2n t\u00edch v\u00e0 Kh\u00e1m ph\u00e1 d\u1eef li\u1ec7u V\u0103n b\u1ea3n<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>NLTK r\u1ea5t h\u1eefu \u00edch cho vi\u1ec7c ph\u00e2n t\u00edch kh\u00e1m ph\u00e1 (exploratory analysis) d\u1eef li\u1ec7u v\u0103n b\u1ea3n. Ng\u01b0\u1eddi d\u00f9ng c\u00f3 th\u1ec3 d\u1ec5 d\u00e0ng th\u1ef1c hi\u1ec7n c\u00e1c th\u1ed1ng k\u00ea t\u1ea7n su\u1ea5t t\u1eeb, v\u1ebd bi\u1ec3u \u0111\u1ed3 ph\u00e2n b\u1ed1 t\u1eeb v\u1ef1ng, t\u00ecm c\u00e1c c\u1ee5m t\u1eeb xu\u1ea5t hi\u1ec7n c\u00f9ng nhau th\u01b0\u1eddng xuy\u00ean (collocations), gi\u00fap thu \u0111\u01b0\u1ee3c nh\u1eefng hi\u1ec3u bi\u1ebft ban \u0111\u1ea7u v\u1ec1 b\u1ed9 d\u1eef li\u1ec7u.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"-Xay-dung-Nguyen-mau-he-thong-NLP\"><\/span><strong>\u00a0X\u00e2y d\u1ef1ng Nguy\u00ean m\u1eabu h\u1ec7 th\u1ed1ng NLP<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Th\u01b0 vi\u1ec7n cho ph\u00e9p c\u00e1c nh\u00e0 ph\u00e1t tri\u1ec3n nhanh ch\u00f3ng x\u00e2y d\u1ef1ng c\u00e1c phi\u00ean b\u1ea3n \u0111\u1ea7u ti\u00ean (nguy\u00ean m\u1eabu &#8211; <a href=\"https:\/\/interdata.vn\/blog\/prototype-la-gi\/\">prototype<\/a>) c\u1ee7a h\u1ec7 th\u1ed1ng NLP. V\u00ed d\u1ee5, c\u00f3 th\u1ec3 t\u1ea1o ra m\u1ed9t b\u1ed9 ph\u00e2n lo\u1ea1i v\u0103n b\u1ea3n \u0111\u01a1n gi\u1ea3n, m\u1ed9t h\u1ec7 th\u1ed1ng nh\u1eadn d\u1ea1ng th\u1ef1c th\u1ec3 c\u01a1 b\u1ea3n, ho\u1eb7c m\u1ed9t chatbot d\u1ef1a tr\u00ean quy t\u1eafc m\u00e0 kh\u00f4ng t\u1ed1n qu\u00e1 nhi\u1ec1u th\u1eddi gian.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"-Phat-trien-cong-cu-ngon-ngu-hoc\"><\/span><strong>\u00a0Ph\u00e1t tri\u1ec3n c\u00f4ng c\u1ee5 ng\u00f4n ng\u1eef h\u1ecdc<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>C\u00e1c nh\u00e0 ng\u00f4n ng\u1eef h\u1ecdc c\u00f3 th\u1ec3 t\u1eadn d\u1ee5ng NLTK \u0111\u1ec3 t\u1ea1o ra c\u00e1c c\u00f4ng c\u1ee5 ph\u00e2n t\u00edch ri\u00eang. V\u00ed d\u1ee5 nh\u01b0 x\u00e2y d\u1ef1ng b\u1ed9 ph\u00e2n t\u00edch h\u00ecnh th\u00e1i (morphological analyzer) cho m\u1ed9t ng\u00f4n ng\u1eef c\u1ee5 th\u1ec3, ho\u1eb7c ph\u00e1t tri\u1ec3n c\u00e1c b\u1ed9 ng\u1eef ph\u00e1p t\u00f9y ch\u1ec9nh \u0111\u1ec3 ph\u00e2n t\u00edch c\u1ea5u tr\u00fac c\u00e2u theo l\u00fd thuy\u1ebft ng\u00f4n ng\u1eef h\u1ecdc.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Ho-tro-tac-vu-Hoc-may-co-ban\"><\/span><strong>H\u1ed7 tr\u1ee3 t\u00e1c v\u1ee5 H\u1ecdc m\u00e1y c\u01a1 b\u1ea3n<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>NLTK kh\u00f4ng ch\u1ec9 ti\u1ec1n x\u1eed l\u00fd v\u0103n b\u1ea3n m\u00e0 c\u00f2n h\u1ed7 tr\u1ee3 tr\u1ef1c ti\u1ebfp m\u1ed9t s\u1ed1 t\u00e1c v\u1ee5 h\u1ecdc m\u00e1y c\u01a1 b\u1ea3n. N\u00f3 cung c\u1ea5p c\u00e1c giao di\u1ec7n \u0111\u1ec3 tr\u00edch xu\u1ea5t \u0111\u1eb7c tr\u01b0ng t\u1eeb v\u0103n b\u1ea3n (v\u00ed d\u1ee5: Bag-of-Words, TF-IDF c\u00f3 th\u1ec3 \u0111\u01b0\u1ee3c x\u00e2y d\u1ef1ng d\u1ef1a tr\u00ean k\u1ebft qu\u1ea3 token h\u00f3a) v\u00e0 t\u00edch h\u1ee3p c\u00e1c b\u1ed9 ph\u00e2n lo\u1ea1i \u0111\u01a1n gi\u1ea3n.<\/p>\n<p>T\u00f3m l\u1ea1i, NLTK (Natural Language Toolkit) kh\u00f4ng ch\u1ec9 l\u00e0 m\u1ed9t th\u01b0 vi\u1ec7n Python \u0111\u01a1n thu\u1ea7n, m\u00e0 c\u00f2n l\u00e0 m\u1ed9t n\u1ec1n t\u1ea3ng h\u1ecdc t\u1eadp v\u00e0 nghi\u00ean c\u1ee9u v\u00f4 gi\u00e1 trong l\u0129nh v\u1ef1c X\u1eed l\u00fd Ng\u00f4n ng\u1eef T\u1ef1 nhi\u00ean.<\/p>\n<p>V\u1edbi b\u1ed9 c\u00f4ng c\u1ee5 \u0111a d\u1ea1ng cho c\u00e1c t\u00e1c v\u1ee5 NLP c\u01a1 b\u1ea3n, kh\u1ea3 n\u0103ng truy c\u1eadp kho ng\u1eef li\u1ec7u phong ph\u00fa v\u00e0 vai tr\u00f2 l\u1ecbch s\u1eed quan tr\u1ecdng, c\u00f4ng c\u1ee5 NLTK l\u00e0 \u0111i\u1ec3m kh\u1edfi \u0111\u1ea7u l\u00fd t\u01b0\u1edfng cho b\u1ea5t k\u1ef3 ai mu\u1ed1n t\u00ecm hi\u1ec3u s\u00e2u v\u1ec1 c\u00e1ch m\u00e1y t\u00ednh c\u00f3 th\u1ec3 hi\u1ec3u v\u00e0 x\u1eed l\u00fd ng\u00f4n ng\u1eef con ng\u01b0\u1eddi.<\/p>\n<p>N\u1ebfu b\u1ea1n \u0111ang \u1ee9ng d\u1ee5ng NLTK \u0111\u1ec3 x\u1eed l\u00fd ng\u00f4n ng\u1eef, hu\u1ea5n luy\u1ec7n m\u00f4 h\u00ecnh NLP hay ch\u1ea1y c\u00e1c t\u00e1c v\u1ee5 h\u1ecdc m\u00e1y c\u1ea7n t\u00e0i nguy\u00ean l\u1edbn, vi\u1ec7c l\u1ef1a ch\u1ecdn h\u1ea1 t\u1ea7ng ph\u00f9 h\u1ee3p l\u00e0 r\u1ea5t quan tr\u1ecdng. InterData cung c\u1ea5p <a href=\"https:\/\/interdata.vn\/hosting-amd\/\">Hosting t\u1ed1c \u0111\u1ed9 cao<\/a> v\u1edbi <a href=\"https:\/\/interdata.vn\/blog\/cpu-server\/\">CPU<\/a> AMD EPYC\/Intel Xeon Platinum, SSD NVMe U.2, <a href=\"https:\/\/interdata.vn\/blog\/bang-thong-la-gi\/\">b\u0103ng th\u00f4ng<\/a> l\u1edbn v\u00e0 dung l\u01b0\u1ee3ng \u0111\u01b0\u1ee3c t\u1ed1i \u01b0u, ph\u00f9 h\u1ee3p cho c\u00e1c d\u1ef1 \u00e1n ng\u00f4n ng\u1eef h\u1ecdc t\u00ednh to\u00e1n ho\u1eb7c h\u1ec7 th\u1ed1ng ph\u00e2n t\u00edch v\u0103n b\u1ea3n.<\/p>\n<p>Ngo\u00e0i ra, InterData c\u00f2n h\u1ed7 tr\u1ee3 <a href=\"https:\/\/interdata.vn\/thue-vps\/\">thu\u00ea VPS ch\u1ea5t l\u01b0\u1ee3ng gi\u00e1 r\u1ebb<\/a> v\u00e0 <a href=\"https:\/\/interdata.vn\/cloud-server\/\">thu\u00ea Cloud Server gi\u00e1 r\u1ebb<\/a>\u00a0v\u1edbi c\u1ea5u h\u00ecnh m\u1ea1nh, hi\u1ec7u su\u1ea5t \u1ed5n \u0111\u1ecbnh, d\u1ec5 m\u1edf r\u1ed9ng. \u0110\u00e2y l\u00e0 l\u1ef1a ch\u1ecdn l\u00fd t\u01b0\u1edfng cho c\u00e1c nh\u00e0 ph\u00e1t tri\u1ec3n \u0111ang x\u00e2y d\u1ef1ng c\u00f4ng c\u1ee5 NLP, chatbot, hay h\u1ec7 th\u1ed1ng ph\u00e2n t\u00edch d\u1eef li\u1ec7u v\u0103n b\u1ea3n t\u1ef1 \u0111\u1ed9ng v\u1edbi Python.<\/p>\n<p><strong>INTERDATA<\/strong><\/p>\n<ul>\n<li><strong><a href=\"https:\/\/interdata.vn\/blog\/website-la-gi\/\">Website<\/a>:<\/strong><span>\u00a0<\/span>Interdata.vn<\/li>\n<li><strong>Hotline:<\/strong><span>\u00a0<\/span>1900-636822<\/li>\n<li><strong>Email:<\/strong><span>\u00a0<\/span>Info@interdata.vn<\/li>\n<li><strong>VP\u0110D:<\/strong><span>\u00a0<\/span>240 Nguy\u1ec5n \u0110\u00ecnh Ch\u00ednh, P.11. Q. Ph\u00fa Nhu\u1eadn, TP. Ho\u0302\u0300 Ch\u00ed Minh<\/li>\n<li><strong>VPGD:<\/strong><span>\u00a0<\/span>S\u1ed1 211 \u0110\u01b0\u1eddng s\u1ed1 5, K\u0110T Lakeview City, P. An Ph\u00fa, TP. Th\u1ee7 \u0110\u1ee9c, TP. H\u1ed3 Ch\u00ed Minh<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>B\u1ea1n \u0111ang t\u00ecm hi\u1ec3u v\u1ec1 NLTK (Natural Language Toolkit) v\u00e0 vai tr\u00f2 c\u1ee7a n\u00f3 trong x\u1eed l\u00fd ng\u00f4n ng\u1eef t\u1ef1 nhi\u00ean (NLP)? B\u00e0i vi\u1ebft n\u00e0y s\u1ebd gi\u1ea3i th\u00edch chi ti\u1ebft NLTK l\u00e0 g\u00ec, kh\u00e1m ph\u00e1 c\u00e1c t\u00ednh n\u0103ng c\u1ed1t l\u00f5i c\u1ee7a Natural Language Toolkit c\u0169ng nh\u01b0 nh\u1eefng \u1ee9ng d\u1ee5ng n\u1ed5i b\u1eadt c\u1ee7a th\u01b0 vi\u1ec7n Python<\/p>\n","protected":false},"author":11,"featured_media":27013,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[140],"tags":[],"class_list":["post-26993","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-lap-trinh"],"_links":{"self":[{"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/posts\/26993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/comments?post=26993"}],"version-history":[{"count":3,"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/posts\/26993\/revisions"}],"predecessor-version":[{"id":27016,"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/posts\/26993\/revisions\/27016"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/media\/27013"}],"wp:attachment":[{"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/media?parent=26993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/categories?post=26993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/interdata.vn\/blog\/wp-json\/wp\/v2\/tags?post=26993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}