RISS 학술연구정보서비스

검색
다국어 입력

http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.

변환된 중국어를 복사하여 사용하시면 됩니다.

예시)
  • 中文 을 입력하시려면 zhongwen을 입력하시고 space를누르시면됩니다.
  • 北京 을 입력하시려면 beijing을 입력하시고 space를 누르시면 됩니다.
닫기
    인기검색어 순위 펼치기

    RISS 인기검색어

      KCI등재 SCOPUS

      A Large-scale Text Analysis with Word Embeddings and Topic Modeling = A Large-scale Text Analysis with Word Embeddings and Topic Modeling

      한글로보기

      https://www.riss.kr/link?id=A106170922

      • 0

        상세조회
      • 0

        다운로드
      서지정보 열기
      • 내보내기
      • 내책장담기
      • 공유하기
      • 오류접수

      부가정보

      다국어 초록 (Multilingual Abstract)

      This research exemplifies how statistical semantic models and word embedding techniques can play a role in understanding the system of human knowledge. Intuitively, we speculate that when a person is given a piece of text, they first classify the sema...

      This research exemplifies how statistical semantic models and word embedding techniques can play a role in understanding the system of human knowledge. Intuitively, we speculate that when a person is given a piece of text, they first classify the semantic contents, group them to semantically similar texts previously observed, then relate their contents with the group. We attempt to model this process of knowledge linking by using word embeddings and topic modeling. Specifically, we propose a model that analyzes the semantic/thematic structure of a given corpus, so as to replicate the cognitive process of knowledge ingestion. Our model attempts to make the best of both word embeddings and topic modeling by first clustering documents and then performing topic modeling on them. To demonstrate our approach, we apply our method to the Corpus of Contemporary American English (COCA). In COCA, the texts are first divided by text type and then by subcategory, which represents the specific topics of the documents. To show the effectiveness of our analysis, we specifically focus on the texts related to the domain of science. First, we cull out science-related texts from various genres, then preprocess the texts into a usable, appropriate format. In our preprocessing steps, we attempt to fine-grain the texts with a combination of tokenization, parsing, and lemmatization. Through this preprocess, we discard words of little semantic value and disambiguate syntactically ambiguous words. Afterwards, using only the nouns from the corpus, we train a word2vec model on the documents and apply K-means clustering to them. The results from clustering show that each cluster represents each branch of science, similar to how people relate a new piece of text to semantically related documents. With these results, we proceed on to perform topic modeling on each of these clusters, which reveal latent topics cluster and their relationship with each other. Through this research, we demonstrate a way to analyze a mass corpus and highlight the semantic/ thematic structure of topics in it, which can be thought as a representation of knowledge in human cognition.

      더보기

      참고문헌 (Reference)

      1 홍정하, "토픽모델링을 이용한 코퍼스의 주제구조 탐색" 언어정보연구소 30 : 239-276, 2017

      2 Manning, Christopher D., "The Stanford CoreNLP Natural Language Processing Toolkit" 55-60, 2014

      3 Radim Rehurek, "Software Framework for Topic Modelling with Large Corpora" 2010

      4 Pedregosa, Fabian, "Scikit-learn: Machine Learning in Python" 2011

      5 Bird, Steven, "Natural Language Processing with Python" O’Reilly Media Inc 2009

      6 Blei, David M., "Latent Dirichlet Allocation" 3 : 993-1022, 2003

      7 Sievert, Carson, "LDAvis: A method for visualizing and interpreting topics" 2014

      8 Alex Wang, "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding" Google DeepMind & NYU 2018

      9 AlexWang, Amapreet Singh, "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding"

      10 Griffiths, Thomas L., "Finding Scientific Topics" PNAS 2004

      1 홍정하, "토픽모델링을 이용한 코퍼스의 주제구조 탐색" 언어정보연구소 30 : 239-276, 2017

      2 Manning, Christopher D., "The Stanford CoreNLP Natural Language Processing Toolkit" 55-60, 2014

      3 Radim Rehurek, "Software Framework for Topic Modelling with Large Corpora" 2010

      4 Pedregosa, Fabian, "Scikit-learn: Machine Learning in Python" 2011

      5 Bird, Steven, "Natural Language Processing with Python" O’Reilly Media Inc 2009

      6 Blei, David M., "Latent Dirichlet Allocation" 3 : 993-1022, 2003

      7 Sievert, Carson, "LDAvis: A method for visualizing and interpreting topics" 2014

      8 Alex Wang, "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding" Google DeepMind & NYU 2018

      9 AlexWang, Amapreet Singh, "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding"

      10 Griffiths, Thomas L., "Finding Scientific Topics" PNAS 2004

      11 Mikolov, Tomas, "Efficient Estimation of Word Representations in Vector Space" NIPS 2013

      12 "COCA"

      더보기

      동일학술지(권/호) 다른 논문

      동일학술지 더보기

      더보기

      분석정보

      View

      상세정보조회

      0

      Usage

      원문다운로드

      0

      대출신청

      0

      복사신청

      0

      EDDS신청

      0

      동일 주제 내 활용도 TOP

      더보기

      주제

      연도별 연구동향

      연도별 활용동향

      연관논문

      연구자 네트워크맵

      공동연구자 (7)

      유사연구자 (20) 활용도상위20명

      인용정보 인용지수 설명보기

      학술지 이력

      학술지 이력
      연월일 이력구분 이력상세 등재구분
      2024 평가예정 해외DB학술지평가 신청대상 (해외등재 학술지 평가)
      2021-01-01 평가 등재학술지 선정 (해외등재 학술지 평가) KCI등재
      2020-12-01 평가 등재후보 탈락 (계속평가)
      2019-12-01 평가 등재후보로 하락 (계속평가) KCI등재후보
      2016-01-01 평가 등재학술지 유지 (계속평가) KCI등재
      2012-01-01 평가 등재 1차 FAIL (등재유지) KCI등재
      2009-01-01 평가 등재학술지 선정 (등재후보2차) KCI등재
      2008-01-01 평가 등재후보 1차 PASS (등재후보1차) KCI등재후보
      2006-01-01 평가 등재후보학술지 선정 (신규평가) KCI등재후보
      더보기

      학술지 인용정보

      학술지 인용정보
      기준연도 WOS-KCI 통합IF(2년) KCIF(2년) KCIF(3년)
      2016 0.08 0.08 0.08
      KCIF(4년) KCIF(5년) 중심성지수(3년) 즉시성지수
      0.06 0.06 0.337 0
      더보기

      이 자료와 함께 이용한 RISS 자료

      나만을 위한 추천자료

      해외이동버튼