У нас вы можете посмотреть бесплатно Document Term Matrix and Vector Space Model as Foundation for Word2Vec, Topic Modeling, IR and NLP или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
More Information about me on https://sirajzade.github.io The link to the code is available on github: https://github.com/sirajzade/learning... The most important aspect here is the relationship between documents and words. For example, looking at how the words behave or occur in the documents, gives us a clue about their own characteristics. By the way, this matrix can be also looked at in a transposed way, that means rows can be columns and vice versa. That means what we said about documents can also be applied to the words. In our example we see that the words bats, cats and dogs have the same vector, namely 1010. Because they all occur in the same documents and thus, they must have similar meaning or belong to a similar sematic field. And if we look closer the documents 1 and 3 have similar vectors and the same is true for the second and forth document. The reason for that is the fact that they share many words. As I mentioned before vectors, together with scalars, matrices and tensor are the topic of linear algebra. They are used across many fields in science. They can be explained numerically and graphically which I will show later in this video. Because vectors are in fact mathematical objects in high dimensional space, what we have seen is also called Vector Space Model. And the operations from linear algebra can also be applied to Document Term Matrix. Here some new terminology: Imagine what would happen when the number and the size of documents grow. Many of our documents will then have lots of 0s in their vector. In fact, a language can have up to a million words. And the number of text documents can also be very high these days. So, we get a very huge matrix with little information. This is called “sparse matrix”. I know it is sometimes confusing because sparse means less, but our matrix is huge. However, sparse here refers to the information in the matrix and to the fact that it is sparsely placed. So, what we do most to the sparse matrixes is that we make it smaller, compress it if you want, while it contains the same amount of information. This is called “dimensionality reduction”, because if you remember from the beginning of the video, we call the length of the vector its dimension. After we apply dimensionality reduction the resulting matrix is called “dense matrix” in contrast to the larger sparse matrix we started with. By the way, the dimensionality reduction can be done for columns and for the rows. One way of doing dimensionality reduction is called Singular Value Decomposition also called SVD. In scikitlearn the function which does it, is called TruncatedSVD and as an argument it takes the number you want to reduce your dimension. How exactly SVD works mathematically I will explain in the next videos. So, let us look at the sorted terms in our topics. We see clearly that the most represented words for the first topic are ‘and’, the, dogs, bats, cats. The first 2 are there because they are present in every document, but the rest makes absolutely sense and shows that this topic is mostly about animals. The second topic is even more strait forward, the terms like structure, data and algorithm dominate. That means, documents where these words are present are more likely to be from the topic of computer science. By the way there are many ways of dealing with the function words which are present in all documents. We populated our matrix with absolute counts of words, which is usually called Count Vectorizer, which is also the case for scikit learn library we use. But, one can use normalized count, also called TFIDFs which I will explain in my next videos. Another way is the so-called stop word removal, where one just deletes all the stop words from the text. So, for the end I want to mention that the handy Word Embeddings everyone in Deep Learning Community including me is using these days, also leverages the same idea. What the word2vec algorithm does, is to create dense vectors for words from their context or from the so-called window. Inside of CBOW architecture of Word2Vec algorithm (CBOW stands for continuous bag of words), it is 5 words to the left and 5 words to the right of the words, but you can also change this number. That is why Word Embeddings can capture semantic related words very nicely and it works like Vector Space Model. Interesting is also how in Word2Vec the vectors are created, but this should be also a topic for another video. Video cuts and pictures courtesy of pixabay.com, music from bensound.com.