Skip to content

Workflow

In this section, we will go step by step, through an example, to see how to use VSS to apply processing to a multimedia data stream. To do this, we will put ourselves in the shoes of a user using VSS for the first time.

Let's take as an example that, for our specific use case, we have an audio stream with the following characteristics:

  • Format: RAW (no specific format)
  • Codec: PCMU
  • Sample rate: 8000 Hz
  • Number of channels: 1

, to which we want to apply processing that, using a VAD (Voice Activity Detection) algorithm, should retain only the parts of the stream that contain speech.

Note

This specific case is a very typical scenario that often arises in contact center environments, where some form of biometric solution is to be applied to the caller's voice. However, the concepts applied here are transferable to any scenario in which some type of processing is to be performed using VSS.

1. Discover pipelines

The first step is to discover which pipelines are available and whether any of them fit the proposed scenario.

curl -v --location --request GET 'https://<VSS_HOST>/api/v1/pipelines' --header 'apikey: *****'

["urn:vss:pipelines:vad"]

Warning

The available pipelines are created by Veridas based on the use cases and needs of the clients. In case there is no suitable processing pipeline for a specific use case, it must be provided by the Veridas team.

2. Get pipeline details

So there's only one pipeline available that could fit the scenario. Let's take a detailed look at it.

curl -v --location --request GET 'https://<VSS_HOST>/api/v1/pipelines/urn:vss:pipelines:vad' --header 'apikey: *****'

{
    "urn": "urn:vss:pipelines:vad:v1",
    "description": """
        This pipeline takes an input stream audio and using a VAD (Voice Activity Detection) algorithm, retains only the parts of the stream that contain speech.
    """,
    "schema": {
        "type": "object",
        "properties": {
            "net_speech_duration": {
                "type": "integer",
                "minimum": 1,
                "maximum": 30
            },
            "media_type": {
                "type": "string",
                "enum": ["audio/x-mulaw", "audio/x-alaw", "audio/x-raw"]
            },
            "sample_rate": {
                "type": "integer",
                "enum": [8000, 16000]
            },
            "sample_width": {
                "type": "integer",
                "enum": [8, 16, 32]
            }
        },
        "if": {
            "properties": { "media_type": { "const": "audio/x-raw" } }
        },
        "then": {
            "required": ["net_speech_duration", "media_type", "sample_rate", "sample_width"]
        },
        "else": {
            "required": ["net_speech_duration", "media_type", "sample_rate"]
        },
        "additionalProperties": false
    },
    "created_at": "2025-05-02T08:31:29.943808Z"
}

Looking at the description, it does indeed seem that the pipeline with the URN urn:vss:pipelines:vad serves the intended purpose. It includes a schema that allows specifying the required voice seconds through the net_speech_duration parameter.

3. Create processing session

Once the pipeline to be used is known, the next step is to create a processing session. This is created by specifying a stream_id, the pipeline_urn to be used and its configuration (via pipeline_data parameter). For this example let's supose we need the first 5 seconds of voice.

curl -v --location --request POST 'https://<VSS_HOST>/api/v1/sessions' --header 'apikey: *****' --header 'Content-Type: application/json' --data '{
    "stream_id": "2a604420-d08b-416e-8c24-9cb217b342fc",
    "pipeline_urn": "urn:vss:pipelines:vad",
    "pipeline_data": {
        "net_speech_duration": 5.0,
        "media_type": "audio/x-mulaw",
        "sample_rate": 8000
    }
}'

{
    "stream_id": "2a604420-d08b-416e-8c24-9cb217b342fc",
    "pipeline_urn": "urn:vss:pipelines:vad",
    "pipeline_data": {
        "net_speech_duration": 5.0,
        "media_type": "audio/x-mulaw",
        "sample_rate": 8000
    },
    "expired_at": "2025-05-31T11:26:18.486407Z",
    "updated_at": "2025-05-30T11:26:18.486407Z",
    "created_at": "2025-05-30T11:26:18.486388Z",
    "status": "waiting_for_stream",
    "stream_processing": {
        "waiting_timeout": "PT5M"
    }
}

Warning

It's important here that pipeline_data complies with the schema of the pipeline; otherwise, a request validation error will occur.

Note

Note that stream_id is a new random UUIDv4 identifier.

It can be verified that the session's initial state is waiting_for_stream, since it doesn't have any stream associated yet. You can also see in the stream_processing parameter that the session's waiting_timeout is 5 minutes, which is the maximum time it will wait for a stream before throwing a timeout error. This is a safety measure in VSS to prevent sessions from waiting indefinitely.

4. Associate a stream

Now, the only thing left to do is to create a stream associated with the session so processing can be accomplished. Here, you can create a stream of one of the types supported by VSS. In this example, we will use a file stream by specifying a stream_id, which should be the same that the one specified for the session so they can be matched.

curl --location 'https://<VSS_HOST>/api/v1/streams/file' --header 'apikey: *****' --form 'stream_id="2a604420-d08b-416e-8c24-9cb217b342fc"' --form 'file=@"/files/test_audio_mulaw__1_8000.raw"'

If no error is returned (202 Accepted), stream is successfully created and associated with the session using the same stream_id. From this point on, processing begins.

Note

In a real contact center scenario, it is common to encounter real-time audio sent through some real-time protocol (WebSocket, RTP, ...). However, for simplicity and better understanding, we use a file stream here. Nevertheless, the workflow for a real-time stream would be exactly the same as the one presented here.

Warning

It's important here that stream format complies with the one specified in pipeline_data session parameter. Otherwise, unexpected results can occur during processing.

5. Check session status

The next step is to check the progress of the processing session.

curl -v --location --request GET 'https://<VSS_HOST>/api/v1/sessions/2a604420-d08b-416e-8c24-9cb217b342fc' --header 'apikey: *****' --header 'Content-Type: application/json'

{
    "stream_id": "2a604420-d08b-416e-8c24-9cb217b342fc",
    "pipeline_urn": "urn:vss:pipelines:vad:v1",
    "pipeline_data": {
        "net_speech_duration": 5.0
        "media_type": "audio/x-mulaw",
        "sample_rate": 8000
    },
    "expired_at": "2025-05-31T11:28:32.182546Z",
    "updated_at": "2025-05-30T11:28:32.182546Z",
    "created_at": "2025-05-30T11:26:18.486388Z",
    "status": "finished",
    "stream_processing": {...}
}

Depending on its status, the response stream_processing parameter will contain different information.

Waiting for stream

waiting_for_stream is the initial state—the same one returned when the session is first created.

{
    "status": "waiting_for_stream",
    "stream_processing": {
        "waiting_timeout": "PT5M"
    }
}
  • waiting_timeout: maximum time the session will wait for a stream before throwing a timeout error.

Processing stream

processing_stream state means that a stream has matched the session so processing has started but it's still pending.

{
    "status": "processing_stream",
    "stream_processing": {
        "started_at": "2025-05-30T11:30:13.137Z",
        "audio_0": {
            "input_audio_duration": 2,
            "net_speech_duration": 1
        }
    }
}
  • started_at: datetime when session processing started.
  • audio_0: this is the name of the resulting processed elementary stream. - input_audio_duration: the duration (in seconds) of the input audio stream. - net_speech_duration: the duration (in seconds) of the resulting audio stream after applying VAD algorithm.

Finished

finished state means processing has been successfully finished.

{
    "status": "finished",
    "stream_processing": {
        "started_at": "2025-05-30T11:30:13.137Z",
        "audio_0": {
            "input_audio_duration": 9,
            "net_speech_duration": 5,
            "files": [
                "urn:vss:files:2a604420-d08b-416e-8c24-9cb217b342fc:audio_0:input_audio",
                "urn:vss:files:2a604420-d08b-416e-8c24-9cb217b342fc:audio_0:net_speech"
            ]
        },
        "finished_at": "2025-05-30T11:30:21.811Z"
    }
}
  • finished_at: datetime when session processing finished.
  • audio_0: - files: list with the URNs of the files produced as a result of the processing.

Cancelled

cancelled state means that stream has finished but processing has not been completed.

{
    "status": "cancelled",
    "stream_processing": {
        "started_at": "2025-05-30T11:30:13.137Z",
        "audio_0": {
            "input_audio_duration": 2,
            "net_speech_duration": 1,
        },
        "cancelled_at": "2025-05-30T11:30:17.739Z"
    }
}
  • cancelled_at: datetime when session processing was cancelled.

Error

error state means that an error occured during processing.

{
    "status": "error",
    "stream_processing": {
        "reason": "Internal Server Error"
    }
}
  • reason: The reason of the error.

6. Get resulting files

If everything went correctly, the session status will be finished, and a series of resulting files will have been generated. These files can be downloaded directly by the user, although it is also possible to use their URNs as identifiers in other Veridas service processes.

curl --location 'https://<VSS_HOST>/api/v1/files/urn:vss:files:2a604420-d08b-416e-8c24-9cb217b342fc:audio_0:net_speech' \
--header 'apikey: *****'

Notice how the URN of a file follows the format:

urn:vss:files:<stream_id>:<processed_stream_name>:<file_name>

This format allows each file to be uniquely identified and referenced across different Veridas services.