ADDING DOMAIN KNOWLEDGE TO INDUCTIVE LEARNING METHODS FOR CLASSIFYING TEXTS

Kevin D. Ashley
University of Pittsburgh
Learning Research and Development Center
Graduate Program in Intelligent Systems

Contact Information
Kevin D. Ashley
3939 O'Hara Street
Pittsburgh, PA 15213
Phone: (412) 624-7496
Fax : (412) 624-9149
Email: ashley+@pitt.edu
http://www.lrdc.pitt.edu/Ashley/Default.htm

WWW PAGE

http://www.pitt.edu/~steffi/CBR/group.html

List of Supported Students and Staff (optional)

Stefanie Brüninghaus, GRA, University of Pittsburgh Graduate Program in Intelligent Systems

Project Award Information

Keywords

case-based reasoning (CBR), automated case indexing, automated text classification, knowledge-guided machine learning, text-oriented CBR, factor-based text classification, legal information retrieval

Project Summary

The work improves current methods for learning to classify texts by incorporating knowledge from an expert domain model. The goal is automatically to classify the texts of legal opinions in terms of the factors that apply to the cases described. Factors, stereotypical fact patterns tending to strengthen or weaken the underlying legal claims in a case, and their relations to legal issues, are a kind of expert domain knowledge useful in legal argumentation. The program takes as inputs the raw texts of legal opinions and assigns as outputs the applicable factors. The program's training instances are drawn from a corpus of legal opinions whose textual descriptions of cases have been represented manually in terms of factors. The problem is hard because the language of the opinions is complex; the mere fact that an opinion discusses factors does not necessarily imply that those factors actually apply to the case. Most recently, we employ ID3 to induce decision trees for classifying by factors. We plan to explore means of inputting certain linguistic information (e.g., about negation) using selective parsing and information extraction techniques.

Publications and Products

Ashley, K.D., and St. Bruninghaus (1998) Developing Mapping and Evaluation Techniques for Textual CBR. In: Proceedings of the AAAI-98 Workshop on Textual Case-Based Reasoning. Pages 20 - 23. AAAI Technical Report WS-98-12. AAAI Press, Menlo Park, CA.

Bruninghaus, S. and Ashley, K.D. (1999a). "Toward Adding Knowledge to Learning Algorithms for Indexing Legal Cases," In Proceedings, Seventh International Conference on Artificial Intelligence and Law, Association of Computing Machinery, New York. Oslo. June. Donald H. Berman Award for Best Student Paper. http://www.pitt.edu/~steffi/papers/icail99.ps.

Bruninghaus, S. and Ashley, K.D. (1999b). "Bootstrapping Case Base Development with Annotated Case Summaries," In Proceedings of the Third International Conference On Case-Based Reasoning. Munich, Germany. July. Outstanding Research Paper Award. http://www.pitt.edu/~steffi/papers/iccbr99.ps.

Bruninghaus, St., and K.D. Ashley (1998a) Evaluation of Textual CBR Approaches. In: Proceedings of the AAAI-98 Workshop on Textual Case-Based Reasoning. Pages 30-34. AAAI Technical Report WS-98-05. AAAI Press, Menlo Park, CA.

Bruninghaus, St., and K.D. Ashley (1998b) How Machine Learning Can be Beneficial for Textual Case-Based Reasoning. In: Proceedings of the AAAI-98/ICML-98 Workshop on Learning for Text Categorization. Pages 71-74. AAAI Technical Report WS-98-05. AAAI Press, Menlo Park, CA.

Bruninghaus, St., and K.D. Ashley (1997a) Finding Factors: Learning to Classify Case Opinions Under Abstract Fact Categories. In: Proceedings of the Sixth International Conference on Artificial Intelligence and Law (ICAIL-97). Pages 123-131. ACM Press, New York, NY. http://www.pitt.edu/~steffi/papers/icail97.ps

Bruninghaus, St., and K.D. Ashley (1997b) Using Machine Learning to Assign Indices to Legal Cases. In: Case Based-Reasoning Research and Development, Proceedings of the Second International Conference on Case-Based Reasoning (ICCBR-97). Pages 303-314. Lecture Notes in Artificial Intelligence 1266. Springer Verlag. Heidelberg, Germany. http://www.pitt.edu/~steffi/papers/iccbr97.ps

Ashley, K.D. (1999) Progress in Text-Based Case-Based Reasoning. Invited Talk for the Third International Conference on Case-Based Reasoning. Seon, Germany. http://www.lrdc.pitt.edu/Ashley/TalkOverheads_files/v3_document.htm

Bruninghaus, St. (1998) Case-Based Reasoning From Textual Documents. Invited Talk at the Sixth German Workshop on Case-Based Reasoning. Extended Abstract published in: Proceedings of the Sixth German Workshop on Case-Based Reasoning (GWCBR-98). Pages 55-58. Berlin, Germany. http://www.pitt.edu/~steffi/papers/slides-gwcbr98.ps

Project Impact

The project has enabled a graduate student in the University of Pittsburgh Graduate Program in Intelligent Systems to pursue her ideal research topic. Stefanie Brüninghaus is performing her Ph.D. dissertation project with this funding and plans to enter academia in AI/Computer Science, a field that needs more female faculty. This funding has already bolstered Ms. Brüninghaus’ professional experience and exposure with an invited talk and two "Best Paper" awards. The work also plays a prominent role as an example in my seminar entitled "Artificial Intelligence and Law," which brings graduate students in the Intelligent Systems Program together with law students. Finally, the work will enable us to improve the intelligent tutoring system CATO, which teaches law students basic skills of legal argument, and to expand its database to include other legal domains.

Goals, Objectives, and Targeted Activities

    Since the start of the original grant, we have engaged in several preliminary experiments to assess the feasibility of applying machine learning to automate index assignment. In our initial experiments, we found that various statistical learning algorithms working with texts represented as vectors of weighted terms would not be appropriate for our task.
    These experiments have led us to the idea of learning rules to classify sentences from manually classified sentences. We decided to use a symbolic learning algorithm, one that learns classification rules, which is more appropriate for comparatively small numbers of training examples of the factors. We chose to implement ID3 (which learns decision trees in which rules are implicit) (Quinlan 1993). In order to reduce complexity, we employed marked-up sentences as training instances rather than full documents. The sentences come from case squibs, brief summaries of the fact situations of all of CATO’s cases, which we had prepared previously for use in CATO’s instruction. Our program SMILE (Smart Index LEarner) employs ID3 to induce decision trees for classifying sentences as positive or negative instances of a factor. ID3 learns a decision tree by recursively partitioning the training set according to the feature that best discriminates positive and negative instances of a factor. The positive training instances are the sentences in the squibs marked-up as substantiating the factor. All other sentences in the squibs are considered negative training instances. Each learned decision tree represents a number of rules for assigning a factor to sentences. We are very excited about the decision trees SMILE has learned. Intuitively, these decision trees confirm that the idea of learning text classification rules from sentences has merit. We have also integrated an application-specific legal thesaurus with the algorithm in order to improve the performance of the decision tree induction.
    We compared the ID3 algorithm and two baselines using the F-measure, where we assign somewhat more weight to recall than precision. The two baselines classify a sentence by determining whether all (or any) of the words in the factor name were present. Since the factor names are very descriptive (e.g., Disclosure-in-Negotiations, Security-Measures, Info-Reverse-Engineerable) a human expert might very reasonably employ this approach as a first pass in identifying factors in a text. Our learning approach outperformed the baselines for all but one of the six factors tested. The recall and precision, calculated by case, reached as high as 80% for finding which factor applies to a case, an indication that the methodology holds promise (Brüninghaus & Ashley, 1999a, 1999b). Integrating a legal thesaurus significantly improved recall and precision for at least three of the factors. We are examining why it did not have this effect for all factors.
    Assuming renewal of funding, our general plan is to: (1) study the performance of the ID3 and other symbolic learning algorithms, (2) investigate whether adding linguistic knowledge using shallow Natural Language Processing or Information Extraction techniques may help process case texts automatically, (3) investigate how to add more domain dependent knowledge to the learning process and evaluate the resulting program, and (4) explore how generally the techniques apply given a different kind of case texts.

Project References (See Publications and Products above)
Area Background

Previously, we developed an expert model of case-based reasoning, which is the basis for an intelligent tutoring system to teach law students argumentation with previous cases available as texts. The texts are legal opinions in which judges record their decisions and rationales for litigated disputes. We have compiled a large corpus of full-text descriptions of cases and a parallel abstract representation of some important aspects of those cases which capture their content and meaning. Our model of expert legal reasoning relates a set of factors, stereotypical factual strengths and weaknesses which tend to strengthen or weaken a legal claim, with the more abstract legal issues to which the factors are relevant. The evidence that factors apply to a given case are passages in the text of the opinions. We have constructed these resources in building the CATO program, an NSF PYI-supported intelligent tutoring environment designed to teach law students to make arguments with cases. CATO's Factor Hierarchy relates factors to more aggregated concepts and ultimately to legal issues raised by the legal claim. Together factors and the Factor Hierarchy enable CATO to generate examples of legal arguments and to provide some feedback on a students' work. We think that using the representation as guidance, a machine learning program trained on the corpus could learn to classify which factors and issues apply in new cases presented as raw texts.

Area References

Rissland, E. L. and Daniels, J. (1995) "A Hybrid CBR-IR Approach to Legal Information Retrieval." In Proceedings of the Fifth International Conference on AI and Law, (ICAIL-95), pp. 52-61. ACM-Press: New York, NY.

Smith, J.C., Gelbart, D., MacKimmon, K., Atherton, B., McClean, J., Shinehoft, M. and Quintana, L. (1995). "Artificial Intelligence and Legal Discourse: The Flexlaw Legal Text Management System". In Artificial Intelligence and Law, Volume 3, Number 1, pp. 55-95. Kluwer Academic Publishers: Dordrecht, The Netherlands.

Turtle, H. (1995) "Text Retrieval in the Legal World". In Artificial Intelligence and Law. Volume 3, Number 1, pp. 5-54. Kluwer Academic Publishers: Dodrecht, The Netherlands.

Potential Related Projects

To be determined