admin管理员组

文章数量:1291127

I seek advice on a classification problem in industry.

The rows in a dataset must be classified/labeled--it lacks a target/column (labels have dot-separated levels like 'x.x.x.x.x.x.x')--during every business cycle. For each cycle, a dataset is given as input to this exercise. The dataset includes some variables, most importantly 1. an ID variable and 2. a short text description. When the dataset is correctly classified, the ID will correspond perfectly to a class. At every iteration, most of the ID---class links as identities will carry over, but some won't: There will be new labels and some labels get redefined. Basically, there will be unlabeled rows at every iteration.

The classes are assigned using variable 2, the text description (besides a few others) compared with the content of a, let's say, scoring manual accompanying the cycle's new dataset. (Well, rather, the dataset and the key come from two independent business processes, but we can ignore that here.) As a de facto scoring manual, the classes are unique and are described only once using a handful of descriptions, which are short text fields about what identifies, =, and what contrasts, !=, a class. So, in the manual (available as a second dataset), each class is described only once.

The desire is to classify the datasets ad infinitum as automatically as possible. The dataset has already been classified at time T0 but will need more new classes at T1. It is possible to supervise a learning algorithm on the first batch. The question is whether subsequent batches require labeling by experts for training the model or, ideally, whether the 'scoring manual' with one row/obs/example per label may suffice for fine-tuning? (more loosely formulated: How much data is needed for transfer learning?; and in the case of training from scratch: how many data points per class is neccesary to train a multi-class deep learning)

本文标签: nlpHow many obs per class are necessarytransfer learning w BERT finetuningStack Overflow