✨ Rubrix, open-source framework for data-centric NLP. Data annotation and monitoring for enterprise NLP

recognai recognai Last update: Oct 03, 2022




CI Codecov CI CI CI CI CI CI

What is Rubrix?

Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Why Rubrix?

  • Open: Rubrix is free, open-source, and 100% compatible with major NLP libraries (Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc.). In fact, you can use and combine your preferred libraries without implementing any specific interface.

  • End-to-end: Most annotation tools treat data collection as a one-off activity at the beginning of each project. In real-world projects, data collection is a key activity of the iterative process of ML model development. Once a model goes into production, you want to monitor and analyze its predictions, and collect more data to improve your model over time. Rubrix is designed to close this gap, enabling you to iterate as much as you need.

  • User and Developer Experience: The key to sustainable NLP solutions is to make it easier for everyone to contribute to projects. Domain experts should feel comfortable interpreting and annotating data. Data scientists should feel free to experiment and iterate. Engineers should feel in control of data pipelines. Rubrix optimizes the experience for these core users to make your teams more productive.

  • Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows.

Example

Interactive weak supervision. Building a news classifier with user search queries:

ws_news.mp4

Check the tutorial for more details.

Features

Advanced NLP labeling

  • Programmatic labeling using Weak Supervision. Built-in label models (Snorkel, Flyingsquid)
  • Bulk-labeling and search-driven annotation
  • Iterate on training data with any pre-trained model or library
  • Efficiently review and refine annotations in the UI and with Python
  • Use Rubrix built-in metrics and methods for finding label and data errors (e.g., cleanlab)
  • Simple integration with active learning workflows

Monitoring

  • Close the gap between production data and data collection activities
  • Auto-monitoring for major NLP libraries and pipelines (spaCy, Hugging Face, FlairNLP)
  • ASGI middleware for HTTP endpoints
  • Rubrix Metrics to understand data and model issues, like entity consistency for NER models
  • Integrated with Kibana for custom dashboards

Team workspaces

  • Bring different users and roles into the NLP data and model lifecycles
  • Organize data collection, review and monitoring into different workspaces
  • Manage workspace access for different users

Get started

Getting started with Rubrix is as easy as:

pip install "rubrix[server]"

If you don't have Elasticsearch (ES) running, make sure you have Docker installed and run:

ℹ️ Check our documentation for further options and configurations regarding Elasticsearch.

docker run -d --name elasticsearch-for-rubrix -p 9200:9200 -p 9300:9300 -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2

Then simply run:

python -m rubrix

Afterward, you should be able to access the web app at http://localhost:6900/.The default username and password are rubrix and 1234.

🚒 If you have problems launching Rubrix, you can get direct support from the maintainers and other community member by joining Rubrix's Slack Community

The following code will log one record into a dataset called example-dataset:

import rubrix as rbrb.log(    rb.TextClassificationRecord(text="My first Rubrix example"),    name='example-dataset')

If you go to your Rubrix web app at http://localhost:6900/, you should see your first dataset.

Congratulations! You are ready to start working with Rubrix. You can continue reading for a working example below.

To better understand what's possible take a look at Rubrix's Cookbook

🆕 Rubrix Cloud Beta: Use Rubrix on a scalable cloud infrastructure without installing the server. Join the waiting list

Quick links

DocDescription
🚶 First stepsNew to Rubrix and want to get started?
👩‍🏫 ConceptsWant to know more about Rubrix concepts?
🛠️ Setup and installHow to configure and install Rubrix
🗒️ TasksWhat can you use Rubrix for?
📱 Web app referenceHow to use the web-app for data exploration and annotation
🐍 Python client APIHow to use the Python classes and methods
👩‍🍳 Rubrix cookbookHow to use Rubrix with your favourite libraries (flair, stanza...)
👋 Community forumAsk questions, share feedback, ideas and suggestions
🤗 Hugging Face tutorialUsing Hugging Face transformers with Rubrix for text classification
💫 spaCy tutorialUsing spaCy with Rubrix for NER projects
🐠 Weak supervision tutorialHow to leverage weak supervision with snorkel & Rubrix
🤔 Active learning tutorialHow to use active learning with modAL & Rubrix

Example

Let's see Rubrix in action with a quick example: Bootstrapping data annotation with a zero-shot classifier

Why:

  • The availability of pre-trained language models with zero-shot capabilities means you can, sometimes, accelerate your data annotation tasks by pre-annotating your corpus with a pre-trained zeroshot model.
  • The same workflow can be applied if there is a pre-trained "supervised" model that fits your categories but needs fine-tuning for your own use case. For example, fine-tuning a sentiment classifier for a very specific type of message.

Ingredients:

  • A zero-shot classifier from the 🤗 Hub: typeform/distilbert-base-uncased-mnli
  • A dataset containing news
  • A set of target categories: Business, Sports, etc.

What are we going to do:

  1. Make predictions and log them into a Rubrix dataset.
  2. Use the Rubrix web app to explore, filter, and annotate some examples.
  3. Load the annotated examples and create a training set, which you can then use to train a supervised classifier.

1. Predict and log

Let's load the zero-shot pipeline and the dataset (we are using the AGNews dataset for demonstration, but this could be your own dataset). Then, let's go over the dataset records and log them using rb.log(). This will create a Rubrix dataset, accessible from the web app.

from transformers import pipelinefrom datasets import load_datasetimport rubrix as rbmodel = pipeline('zero-shot-classification', model="typeform/distilbert-base-uncased-mnli")dataset = load_dataset("ag_news", split='test[0:100]')labels = ['World', 'Sports', 'Business', 'Sci/Tech']records = []for item in dataset:    prediction = model(item['text'], labels)    records.append(        rb.TextClassificationRecord(            text=item["text"],            prediction=list(zip(prediction['labels'], prediction['scores']))        )    )rb.log(records, name="news_zeroshot")

2. Explore, Filter and Label

Now let's access our Rubrix dataset and start annotating data. Let's filter the records predicted as Sports with high probability and use the bulk-labeling feature for labeling 5 records as Sports:

3. Load and create a training set

After a few iterations of data annotation, we can load the Rubrix dataset and create a training set to train or fine-tune a supervised model.

# load the Rubrix dataset and put it into a pandas DataFramerb_df = rb.load(name='news_zeroshot').to_pandas()# filter annotated recordsrb_df = rb_df[rb_df.status == "Validated"]# select text input and the annotated labeltrain_df = pd.DataFrame({    "text": rb_df.text,    "label": rb_df.annotation,})

Architecture

Rubrix main components are:

  • Rubrix Python client: Python client to log, load, copy and delete Rubrix datasets.
  • Rubrix server: FastAPI REST service for reading and writing data.
  • Elasticsearch: The storage layer and search engine powering the API and the web app.
  • Rubrix web app: Easy-to-use web application for data exploration and annotation.

FAQ

What is Rubrix?

Rubrix is an open-source MLOps tool for building and managing training data for Natural Language Processing.

What can I use Rubrix for?

Rubrix is useful if you want to:

  • create a data set for training a model.
  • evaluate and improve an existing model.
  • monitor an existing model to improve it over time and gather more training data.

What do I need to start using Rubrix?

You need to have a running instance of Elasticsearch and install the Rubrix Python library.

The library is used to read and write data into Rubrix.To get started we highly recommend using Jupyter Notebooks so you might want to install Jupyter Lab or use Jupiter support for VS Code for example.

How can I "upload" data into Rubrix?

Currently, the only way to upload data into Rubrix is by using the Python library.This is based on the assumption that there's rarely a perfectly prepared dataset in the format expected by the data annotation tool.

Rubrix is designed to enable fast iteration for users that are closer to data and models, namely data scientists and NLP/ML/Data engineers.If you are familiar with libraries like Weights & Biases or MLFlow, you'll find Rubrix log and load methods intuitive.That said, Rubrix gives you different shortcuts and utils to make loading data into Rubrix a breeze, such as the ability to read datasets directly from the Hugging Face Hub.

In summary, the recommended process for uploading data into Rubrix would be following:

(1) Install Rubrix Python library,

(2) Open a Jupyter Notebook,

(3) Make sure you have a Rubrix server instance up and running,

(4) Read your source dataset using Pandas, Hugging Face datasets, or any other library,

(5) Do any data preparation, pre-processing, or pre-annotation with a pretrained model, and

(6) Transform your dataset rows/records into Rubrix records and log them into a Rubrix dataset using rb.log. If your dataset is already loaded as a Hugging Face dataset, check the read_datasets method to make this process even simpler.

How can I train a model

The training datasets curated with Rubrix are model agnostic.You can choose one of many amazing frameworks to train your model, like transformers, spaCy, flair or sklearn.Check out our cookbook and our tutorials on how Rubrix integrates with these frameworks.

If you want to train a Hugging Face transformer, we provide a neat shortcut to prepare your Rubrix dataset for training.

Can Rubrix share the Elasticsearch Instance/cluster?

Yes, you can use the same Elasticsearch instance/cluster for Rubrix and other applications.You only need to perform some configuration, check the Advanced installation guide in the docs.

How to solve an exceeded flood-stage watermark in Elasticsearch?

By default, Elasticsearch is quite conservative regarding the disk space it is allowed to use.If less than 5% of your disk is free, Elasticsearch can enforce a read-only block on every index, and as a consequence, Rubrix stops working.To solve this, you can simply increase the watermark by executing the following command in your terminal:

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'{"persistent": {"cluster.routing.allocation.disk.watermark.flood_stage":"99%"}}'

Community

As a new open-source project, we are eager to hear your thoughts, fix bugs, and help you get started.Feel free to join us on Slack, use the Discussion forum or open Issues and we'll be pleased to help out.

Contributors

Cara menggunakannya lgsg saja onex start

Subscribe to our newsletter