WEB Sensorizer


WEB Sensorizer: An Architecture for Regenerating Cyber Physical Data Streams from the Web

World Wide Web contains a huge amount of periodically-updated values originally sensed from the physical world; they include, for example, density of air pollutant, road traffic condition, and car park occupancy. In many cases, however, those data are not easily accessible from a computer program due to the lack of APIs to fetch them. In this paper, to cope with this problem, we propose an architecture for discovering, excavating, and streaming the entombed web contents (EWC). This architecture, called Sensorizer, leverages crowd sourcing for accurate EWC discovery, periodic web scraping with a headless browser for excavation from dynamic web pages, and a standardized communication protocol (XMPP) for data streaming to wide variety of applications.

What is WEB Sensorizer?


WEB sensorizer is a novel tool for acquiring real-world data in a simple yet extensible way. Its major idea is to put “virtual” sensors on web pages that contain meaningful values sensed from the real world. Right figure shows such web pages that contains air quality information sensed in corresponding cities. The numbers shown in these pages are updated periodically, and the past numbers becomes unaccessible since they are stored deep in a database. Virtual sensors put on these pages periodically transmit the numbers, which are scraped from them. Since there are a number of web pages that contain real world data, and also deploying a virtual sensor needs just a few steps of GUI manipulation, virtual sensing can generate a huge amount of data that help understand the real world. On the system’s aspect, our virtual sensing technique is a set of the following components.

Authoring Tool

This client-side tool is an extension module of the Chrome browser that enables browser users to deploy virtual sensors on almost arbitrary elements on a web page.


Probe is the server-side program that inputs a virtual sensor definition, which includes the URL of a web page and the target elements’ XPath, and periodically scrapes the element values from the page. It also has functionality to explore similar structure’s WEB pages and sensorize the page automatically. Probe uses Java version of SOXFire API.

The data scraped from web pages is published to SOXFire. Then, the data are transmitted to their subscribers via XMPP protocol. So far we have sensorized more than 400 thousands of WEB pages and being generating more than 20GB/day sensor data stream via SOXFire.

How can I use it ?

Please see manual. Also following video is fun to understand WEB Sensorizer.