About/FAQ
Overview
This site demonstrates topic modeling on a
collection of 12,000 papers harvested from the websites of UCSD and
UCI faculty. The probabilistic topic model automatically learns the
topics covered in the collection of text documents with no a priori
definition of the topics.
How were researchers selected?
Researchers were found
using a semi-automated crawl of UCSD and UCI faculty websites. We
searched for publications pages that contained downloadable pdf files.
We then downloaded the pdf files and converted them to plain text.
How does the topic model work?
The topic model is based on the
idea that documents are made up of topics, where topics are
probability distributions over words. In essence, the topic model
looks for sets of words that tend to co-occur in documents. All word
order information is discarded before running the topic model.
Have you made my papers publicly available?
No.
Why am I not included?
We collected papers using a
semi-automated harvester. If you had papers that did not appear as
downloadable files we were unfortunately not able to get them.
I seem to be mis-represented!
The topical characterization
for each researcher was solely based on the papers collected from
them. Results for particular researchers may be noisy due to the
limited amount of data collected for that researcher.
Who came up with the topic names?
While the topic model
produces the list of most likely words in each topic, it is up to a
human to assign a sensible topic name. Ideally, a domain expert
reviews the list of words and assigns a short name. For this project
we did not have the resources to consult all the appropriate domain
experts for the wide range of topics, so our team sometimes made a
best guess at an appropriate name. If you see a topic that is
mis-labeled, please email newman@uci.edu.
Why are certain people listed under a particular research topic?
This is solely based on papers harvested and does not
necessarily reflect the contribution or activity of faculty in the
research area.
References
- Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths,
T. (2004). Probabilistic Author-Topic Models for Information
Discovery. The Tenth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. Seattle, Washington.
- Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235.
- Newman, D., Block, S. (2005). Probabilistic Topic Decomposition of an Eighteenth-Century Newspaper. Journal of the American Society for Information Science and Technology.
- Blei D.M.; Ng A.Y.; Jordan M.I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, Volume 3, Numbers 4-5, 15 May 2003, pp. 993-1022(30).
- Hofmann, T. (1999). Probabilistic Latent Semantic Indexing. Proceedings of the 22nd International Conference on Research and Development in Information Retrieval.