Saturday, March 21, 2015

Flexible run-time logging configuration in Apache Solr 4.10.x

In a multi-shard setup it is useful to be able to change log level in runtime without going to each and every shard's admin page.

For example, we can set the logging to WARN level during massive posting sessions and back to INFO, when serving the user queries.

In solr 4.10.2 these one-liners do the trick:

# set logging level to WARN,
# saves disk space and speeds up massive posting 
curl -s http://localhost:8983/solr/admin/info/logging \
                       --data-binary "set=root:WARN&wt=json" 
# set logging level to INFO,
# suitable for serving the user queries 
curl -s http://localhost:8983/solr/admin/info/logging \
                       --data-binary "set=root:INFO&wt=json"

Back from Solr you get a JSON with the current status of each configured logger.

Monday, March 16, 2015

Luke keeps getting updates and now on Apache Pivot

Originally developed for fun and profit by Andrzej Bialecki, the lucene toolbox luke continues to be developed. Its releases are published at:

Most recently Tomoko Uchida has contributed into effort of porting Luke to an Apache License 2.0 friendly GUI framework Apache Pivot. New branch has been created to host this work:

Currently supported Lucene: 4.10.4.

It is far from completion, but already now you can:

  • open your Lucene index and check its metadata

  • page through the documents and analyze fields

  • search the index

We will appreciate if you could test the pivot luke and give your feedback.

Monday, November 17, 2014

Lightweight Java Profiler and Interactive svg Flame Graphs

A colleague of mine has just returned from the AWS re:Invent and brought in all the excitement about new AWS technologies. So I went on to watching the released videos of the talks. One of the first technical ones I have set on watching was Performance Tuning Amazon EC2 Instances by Brendan Gregg of Netflix. From Brendan's talk I have learnt about Lightweight Java Profiler (LJP) and visualizing stack traces with Flame Graphs.

I'm quite 'obsessed' with monitoring and performance tuning based on it.
Monitoring your applications is definitely the way to:

1. Get numbers on performance inside your company, spread them and let people talk stories about them.
2. Tune the system in where you see the bottleneck and measure again.

In this post I would like to share a shell script that will produce a colourful and interactive flame graph out of a stack trace of your java application. This may be useful in a variety of ways, starting from an impressive graph for you slides to making informed tuning of your code / system.

Components to build / install

This was run on ubuntu 12.04 LTS.
Checkout the Lightweight Java Profiler project source code and build it:

svn checkout \ \
cd lightweight-java-profiler-read-only/
make BITS=64 all

(omit the BITS parameter if you want to build for 32 bit platform).

As a result of successful compilation you will have a binary that will be used to configure your java process.

Next, clone the FlameGraph github repository:

git clone

You don't need to build anything, it is a collection of shell / perl scripts that will do the magic.

Configuring the LJP agent on your java process

Next step is to configure the LJP agent to report stats from your java process. I have picked a Solr instance running under jetty. Here is how I have configured it in my Solr startup script:

java \
      build-64/ \
-Dsolr.solr.home=cores start.jar

Executing the script should start the Solr instance normally and will be logging stack trace to traces.txt

Generating a Flame graph

In order to produce a flame graph out of the LJP stack trace you will need to perform the following:

1. Convert LJP stack trace into a collapsed form that FlameGraph understands.

2. Call tool on the collapsed stack trace and produce the svg file.

I have written a shell script that will do this for you.


# change this variable to point to your FlameGraph directory


   $(dirname $LJP_TRACES_FILE)\
       $(dirname $LJP_TRACES_FILE)/${FILENAME%.*}.svg

# collapse the LJP stack trace
$FLAME_GRAPH_HOME/stackcollapse-ljp.awk $LJP_TRACES_FILE > \

# create a flame graph

And here is the flame graph of my Solr instance under the indexing load.

You could interpret this diagram bottom-up: the lowest level is entry point class that starts the application. Then we see that CPU-wise two methods are taking the most of the time: org.eclipse.jetty.start.Main.main and

This svg diagram is in fact an interactive one: load it in the browser and click on the rectangles with methods you would like to explore more. I have clicked on the
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd rectangle and drilled down to it:

It is this easy to setup a CPU performance check for your java program. Remember to monitor before tuning your code and wear a helmet.

Friday, November 14, 2014

Ruby pearls and gems for your daily routine coding tasks

This is a list of ruby pearls and gems, that help me in my daily routine coding tasks.

1. Retain only unique elements in an array:

a = [1, 1, 2, 3, 4, 4, 5]

a = a.uniq # => [1, 2, 3, 4, 5]

2. Command line options parsing:

require 'optparse'
class Optparser

def self.parse(args)
  options = {} do |opts|
    opts.banner = "Usage: example.rb [options]"

    opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
     options[:verbose] = v

   opts.on("-o", "--require OUTPUTDIR", "Output directory") do |o|
     options[:output_dir] = o

   options[:source_dir] = []
     opts.on("-s", "--require SOURCEDIR", "Source directory") do |s|
     options[:source_dir] << s



options = Optparser.parse(ARGV) #pp options  When executed with -h, this script will automatically show the options and exit.  

3. Delete a key-value pair in the hash map, where the key matches certain condition:

hashMap.delete_if {|key, value| key == "someString" }

Certainly, you can use regular expression based matching for the condition or a custom function, say, on the 'key' value.

4. Interacting with mysql. I use mysql2 gem. Check out the documentation, it is pretty self-evident.

5. Working with Apache SOLR: rsolr and rsolr-ext are invaluable here:

require 'rsolr'
require 'rsolr-ext'
solrServer = RSolr::Ext.connect :url => $solrServerUrl, :read_timeout => $read_timeout, :open_timeout => $open_timeout

doc = {field1=>"value1", "field2"=>"value2"}

solrServer.add doc

solrServer.commit(:commit_attributes => {:waitSearcher=>false, :softCommit=>false, :expungeDeletes=>true})
solrServer.optimize(:optimize_attributes => {:maxSegments=>1}) # single segment as output

Tuesday, September 23, 2014

Indexing documents in Apache Solr using custom update chain and solrj api

This post focuses on how to target custom update chain using solrj api and index your documents in Apache Solr. The reason for this post existence is because I have spent more than one hour figuring this out. This warrants a blog post (hopefully for other's benefit as well).


Suppose that you have a default update chain, that is executed in every day situations, i.e. for majority of input documents:

<updaterequestprocessorchain default="true" name="everydaychain">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />

In some specific cases you would like to execute a slightly modified update chain, in this case with a factory that drops duplicate values from document fields. For that purpose you have configured a custom update chain:

<updaterequestprocessorchain default="true" name="customchain">
<processor class="solr.UniqFieldsUpdateProcessorFactory" >
<lst name="fields">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />

Your update request handler looks like this:

<requesthandler class="solr.UpdateRequestHandler" name="/update">
<lst name="defaults">
<str name="update.chain">everydaychain</str>

Every time you hit /update from your solrj backed code, you'll execute document indexing using the "everydaychain".


Using solrj, index documents against the custom update chain.


First before diving into the solution, I'll show the code that you use for normal indexing process from java, i.e. with every:

HttpSolrServer httpSolrServer = null;
try {
     httpSolrServer = new HttpSolrServer("http://localhost:8983/solr/core0");
     SolrInputDocument sid = new SolrInputDocument();
     sid.addField("field1", "value1");

     httpSolrServer.commit(); // hard commit; could be soft too
} catch (Exception e) {
     if (httpSolrServer != null) {

So far so good. Next turning to indexing with custom update chain. This part of non-obvious from the point of view of solrj api design: having an instance of SolrInputDocument, how would one access a custom update chain? You may notice, how the update chain is defined in the update request handler of your solrconfig.xml. It uses the update.chain parameter name. Luckily, this is an http parameter, that can be supplied on the /update endpoint. Figuring this out via http client of the httpSolrServer object led to nowhere.

Turns out, you can use UpdateRequest class instead. The object has got a nice setParam() method that lets you set a value for the update.chain parameter:

HttpSolrServer httpSolrServer = null;
        try {
            httpSolrServer = new HttpSolrServer(updateURL);

            SolrInputDocument sid = new SolrInputDocument();
            // dummy field
            sid.addField("field1", "value1");

            UpdateRequest updateRequest = new UpdateRequest();
            updateRequest.setParam("update.chain", "customchain");

            UpdateResponse updateResponse = updateRequest.process(httpSolrServer);
            if (updateResponse.getStatus() == 200) {
      "Successfully added a document");
            } else {
      "Adding document failed, status code=" + updateResponse.getStatus());
        } catch (Exception e) {
            if (httpSolrServer != null) {
      "Released connection to the Solr server");


Executing the second code will trigger the LogUpdateProcessor to output the following line in the solr logs:

org.apache.solr.update.processor.LogUpdateProcessor  –
   [core0] webapp=/solr path=/update params={wt=javabin&

That's it for today. Happy indexing!

Wednesday, September 17, 2014

Exporting Lucene index to xml with Luke

Luke is the open source Lucene toolbox originally written by Andrzej Bialecki and currently maintained by yours truly. The tool allows you to introspect into your solr / lucene index, check it for health, fix problems, verify field tokens and even experiment with scoring or read the index from HDFS.

In this post I would like to illustrate one particular luke's feature, that allows you to dump index into an xml for external processing.


Extract indexed tokens from a field to a file for further analysis outside luke.


Indexing data

In order to extract tokens you need to index your field with term vectors configured. Usually, this also means, that you need to configure positions and offsets.

If you are indexing using Apache Solr, you would configure the following on your field:

<field indexed="true" name="Contents" omitnorms="false" stored="true" termoffsets="true" termpositions="true" termvectors="true" type="text">

With this line you make sure you field is going to store its contents, not only index; it will also store the term vectors, i.e. a term, its positions and offsets in the token stream.


Extracting index terms

One way to view the indexed tokens with luke is to search / list documents, select the field with term vectors enabled and click TV button (or right-click and choose "Field's Term Vector").

If you would like to extract this data into an external file, there is a way currently to accomplish this via menu Tools->Export index to XML:

In this case I have selected the docid 94724 (note, that this is lucene's internal doc id, not solr application level document id!), that is visible when viewing a particular document in luke. This dumps a document into the xml file, including the fields in the schema and each field's contents. In particular, this will dump the term vectors (if present) of a field, in my case:

<field flags="Idfp--SV-Nnum--------" name="Contents">
<val>CENTURY TEXT.</val>
<t freq="1" offsets="0-7" positions="0" text="centuri" />
<t freq="1" offsets="0-7" positions="0" text="centuryä" />
<t freq="1" offsets="8-12" positions="1" text="text" />
<t freq="1" offsets="8-12" positions="1" text="textä" />

Monday, June 9, 2014

Low-level testing your Lucene TokenFilters

On the recent Berlin buzzwords conference talk on Apache Lucene 4 Robert Muir mentioned the Lucene's internal testing library. This library is essentially the collection of classes and methods that form the test bed for Lucene committers. But, as a matter of fact, the same library can be perfectly used in your own code. David Weiss has talked about randomized testing with Lucene, which is not the focus of this post but is really a great way of running your usual static tests with randomization.

This post will show a few code snippets, that illustrate the usage of the Lucene test library for verifying the consistency of your custom TokenFilters on lower level, than your might used to.

I'm putting this fancy term graph to prove, that posts with images are opened more often, than those without. Ok, it has relevant parts too: in particular we are looking into creating our own TokenFilter in parallel to StopFilter, LowerCaseFilter, StandardFilter and PorterStemFilter.).

In the naming convention spirit of the previous post, where custom classes started with GroundShaking prefix, let's create our own MindBlowingTokenFilter class. For the sake of illustration, our token filter will take each term from the term stream, add "mindblowing" suffix to it and store in the stream as a new term. This class will be a basis for writing unit-tests.

package com.dmitrykan.blogspot;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;


 * Created by dmitry on 6/9/14.
public final class MindBlowingTokenFilter extends TokenFilter {

    private final CharTermAttribute termAtt;
    private final PositionIncrementAttribute posAtt;
    // dummy thing, is needed for complying with BaseTokenStreamTestCase assertions
    private PositionLengthAttribute posLenAtt; // don't remove this, otherwise the low-level test will fail

    private State save;

    public static final String MIND_BLOWING_SUFFIX = "mindblowing";

     * Construct a token stream filtering the given input.
     * @param input
    protected MindBlowingTokenFilter(TokenStream input) {
        this.termAtt = addAttribute(CharTermAttribute.class);
        this.posAtt = addAttribute(PositionIncrementAttribute.class);
        this.posLenAtt = addAttribute(PositionLengthAttribute.class);

    public boolean incrementToken() throws IOException {
        if( save != null ) {
            save = null;
            return true;

        if (input.incrementToken()) {
            // pass through zero-length terms
            int oldLen = termAtt.length();
            if (oldLen == 0) return true;
            int origOffset = posAtt.getPositionIncrement();

            // save original state
            save = captureState();

            //char[] origBuffer = termAtt.buffer();

            char [] buffer = termAtt.resizeBuffer(oldLen + MIND_BLOWING_SUFFIX.length());

            for (int i = 0; i < MIND_BLOWING_SUFFIX.length(); i++) {
                buffer[oldLen + i] = MIND_BLOWING_SUFFIX.charAt(i);

            termAtt.copyBuffer(buffer, 0, oldLen + MIND_BLOWING_SUFFIX.length());

            return true;
        return false;

The next thing we would like to do is to write a Lucene-level test suite for this class. We will extend it from BaseTokenStreamTestCase, not standard TestCase or other class from a testing framework you might have used to deal with. The reason being we'd like to utilize the internal Lucene's test functionality, that lets you access and cross check the lower-level items, like term position increments, position lengths, position start and end offsets etc.

About the same information you can see with Apache Solr's analysis page, if you enable verbose mode. While the analysis page is good to visually debug your code, the unit test is meant to run for you every time you change and build you code. If you decide to first visually examine the term positions, start and end offsets with Solr, you'll need to wrap the token filter into factory and register it in the schema on your field type. The factory code:

package com.dmitrykan.blogspot;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;

import java.util.Map;

 * Created by dmitry on 6/9/14.
public class MindBlowingTokenFilterFactory extends TokenFilterFactory {
    public MindBlowingTokenFilterFactory(Map args) {

    public MindBlowingTokenFilter create(TokenStream input) {
        return new MindBlowingTokenFilter(input);


Here is the test class in all its glory.

package com.dmitrykan.blogspot;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.MockTokenizer;
import org.apache.lucene.analysis.Tokenizer;


 * Created by dmitry on 6/9/14.
public class TestMindBlowingTokenFilter extends BaseTokenStreamTestCase {
    private Analyzer analyzer = new Analyzer() {
        protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
            Tokenizer source = new MockTokenizer(reader, MockTokenizer.WHITESPACE, true);
            return new TokenStreamComponents(source, new MindBlowingTokenFilter(source));

    public void testPositionIncrementsSingleTerm() throws IOException {

        String output[] = {"queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries"};
        // the position increment for the first term must be one in this case and of the second must be 0,
        // because the second term is stored in the same position in the token filter stream
        int posIncrements[] = {1, 0};
        // this is dummy stuff, but the test does not run without it
        int posLengths[] = {1, 1};

        assertAnalyzesToPositions(analyzer, "queries", output, posIncrements, posLengths);

    public void testPositionIncrementsTwoTerm() throws IOException {

        String output[] = {"your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your", "queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries"};
        // the position increment for the first term must be one in this case and of the second must be 0,
        // because the second term is stored in the same position in the token filter stream
        int posIncrements[] = {1, 0, 1, 0};
        // this is dummy stuff, but the test does not run without it
        int posLengths[] = {1, 1, 1, 1};

        assertAnalyzesToPositions(analyzer, "your queries", output, posIncrements, posLengths);

    public void testPositionIncrementsFourTerms() throws IOException {

        String output[] = {
                "your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your",
                "queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries",
                "are" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "are",
                "fast" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "fast"};
        // the position increment for the first term must be one in this case and of the second must be 0,
        // because the second term is stored in the same position in the token filter stream
        int posIncrements[] = {
                1, 0,
                1, 0,
                1, 0,
                1, 0};
        // this is dummy stuff, but the test does not run without it
        int posLengths[] = {
                1, 1,
                1, 1,
                1, 1,
                1, 1};

        // position increments are following the 1-0 pattern, because for each next term we insert a new term into
        // the same position (i.e. position increment is 0)
        assertAnalyzesToPositions(analyzer, "your queries are fast", output, posIncrements, posLengths);

    public void testPositionOffsetsFourTerms() throws IOException {

        String output[] = {
                "your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your",
                "queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries",
                "are" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "are",
                "fast" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "fast"};
        // the position increment for the first term must be one in this case and of the second must be 0,
        // because the second term is stored in the same position in the token filter stream
        int startOffsets[] = {
                0, 0,
                5, 5,
                13, 13,
                17, 17};
        // this is dummy stuff, but the test does not run without it
        int endOffsets[] = {
                4, 4,
                12, 12,
                16, 16,
                21, 21};

        assertAnalyzesTo(analyzer, "your queries are fast", output, startOffsets, endOffsets);


All tests should pass and yes, the same numbers are present on the Solr's analysis page:

MindBlowingTokenFilter solr analysis page

Happy unit testing with Lucene!

your @dmitrykan