Processing Twitter (JSON) Data in Oracle (12c External Table)

Problem:

We have live Twitter stream data ingested by Flume to our Hadoop cluster.
Flume is generating too many files in HDFS, 2 files in 1 second about 172k files in a day.
We have to process the Flume generated twitter JSON files.
Created an Oracle external table over twitter JSON files but performance is too bad because of too many files.
Need a remedy for the above issues.

Pre-requisites:

Streaming Twitter Data using Apache Flume

Streaming Twitter Data by Flume using Cloudera Twitter Source

Creating Oracle External Table (12c) on HDFS using HDP NFS Gateway

Process JSON Data

JSON is a popular textual data format used for exchanging data in modern web and mobile applications. JSON is also used for storing unstructured data in log files or NoSQL databases.

Oracle's JSON_TABLE function creates a relational view of JSON data. It maps the result of a JSON data evaluation into relational rows and columns. You can query the result returned by the function as a virtual relational table using SQL. The main purpose of JSON_TABLE is to create a row of relational data for each object inside a JSON array and output JSON values from within that object as individual SQL column values.

The JSON_TABLE function incorporates all the functionality of JSON_VALUE, JSON_EXISTS and JSON_QUERY. The syntax is a little more complicated that using the individual JSON functions, but it is more efficient to use a single JSON_TABLE call than combining multiple calls to the other individual functions into a single query.

You can specify JSON_TABLE only in the FROM clause of a SELECT statement. The function first applies a JSON path expression, called a row path expression, to the supplied JSON data. The JSON value that matches the row path expression is called a row source in that it generates a row of relational data. The COLUMNS clause evaluates the row source, finds specific JSON values within the row source, and returns those JSON values as SQL values in individual columns of a row of relational data.

1- First check the files on the HDFS location (NFS location), You will observe many small files created by Flume.

--get the files in order and observe the file name pattern

[root@te1-hdp-rp-en01 bin]# ls -al /data/hdfsloc/flume/twitter | more

2- Merge the small files into one large file, later we will create the external table over this large file which will have better performance rather than external table with too many small files.

[root@te1-hdp-rp-en01 twitter]# cat /data/hdfsloc/flume/twitter/FlumeData.1514* >> AllTweetsP1.json

I used above command after observing the file name pattern, this way I was able to track which files i've merged into the large file and which ones still left.

You can query the .tmp file generated by Flume to track your Flume generated file name pattern. Then you can merge the large file keeping track.

[root@te1-hdp-rp-en01 twitter]# ls /data/hdfsloc/flume/twitter/*.tmp

FlumeData.1515318015287.tmp

3- After merging the small files into large one , you could delete the small files as below.

rm -f /data/hdfsloc/flume/twitter/FlumeData.1514*

4- Create external table now on the large file. As one tweet data is more than 4000 char, create the column with CLOB data type. I'm going to create the table with only one column which will have raw tweet as it is as received by Flume agent.

CREATE OR REPLACE DIRECTORY nfs_tweet_dir AS '/data/hdfsloc/flume/twitter/';

CREATE OR REPLACE DIRECTORY admin_log_dir AS '/data/log/'

drop table xt_tweets_nfs;

CREATE TABLE xt_tweets_nfs (

JSON_LINE CLOB

)

ORGANIZATION EXTERNAL

(

TYPE ORACLE_LOADER

DEFAULT DIRECTORY NFS_TWEET_DIR

ACCESS PARAMETERS

(

RECORDS DELIMITED BY NEWLINE

BADFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.bad'

LOGFILE ADMIN_LOG_DIR: 'xt_tweets_nfs%a_%p.log'

DISCARDFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.disc'

DNFS_ENABLE

DNFS_READBUFFERS 10

FIELDS TERMINATED BY '<FT>'

OPTIONALLY ENCLOSED BY '"'

MISSING FIELD VALUES ARE NULL

(

json_line char(50000)

)

LOCATION (NFS_TWEET_DIR:'AllTweetsP1.json')

)

PARALLEL 1

REJECT LIMIT 10;

SQL> alter table xt_tweets_nfs parallel (degree 8);

Table altered.

5- Check the one line from your large JSON file to see the tweet data and its format and then run count query external table to know how many tweets you got in your large file.

SQL> host head -1 /data/hdfsloc/flume/twitter/AllTweetsP1.json

SQL> select count(*) from xt_tweets_nfs;

COUNT(*)

----------

388349

Query the tweet data using external table

SQL> set long 30000

SQL> set longchunksize 30000

SQL> set pages 100

SQL> select json_line from xt_tweets_nfs where rownum < 2;

JSON_LINE

--------------------------------------------------------------------------------

{"extended_tweet":{"entities":{"urls":[],"hashtags":[],"user_mentions":[{"indice

s":[0,13],"screen_name":"Citizen_care","id_str":"827055815026737152","name":"خدم

ة العملاء","id":827055815026737152},{"indices":[173,182],"screen_name":"AliRajhi

","id_str":"876924188963024897","name":"علي بن هادي راجحي","id":8769241889630248

97}],"symbols":[]},"full_text":"@Citizen_care لم أستفسر عن مقدار الإستحقاق؟ \nإس

تفساري ما هو وضعي بالضبط؟ \nهل أنا مستحق أو لا؟ \n و في حال إستحقاقي لماذا لم يص

رف لي؟ \nو في حال عدم إستحقاقي لماذا غير مستحق؟ \n@AliRajhi","display_text_range

":[14,182]},"in_reply_to_status_id_str":"945629575689752576","in_reply_to_status

_id":945629575689752576,"created_at":"Tue Dec 26 14:53:44 +0000 2017","in_reply_

to_user_id_str":"827055815026737152","source":"<a href=\"http://twitter.com/down

load/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>","retweet_count":0,"retwe

eted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":"Citizen_c

are","is_quote_status":false,"id_str":"945668962481770497","in_reply_to_user_id"

:827055815026737152,"favorite_count":0,"id":945668962481770497,"text":"@Citizen_

care لم أستفسر عن مقدار الإستحقاق؟ \nإستفساري ما هو وضعي بالضبط؟ \nهل أنا مستحق

أو لا؟ \n و في حال إستحقاقي ل\u2026 https://t.co/QF9rY4gu0R","place":null,"lang"

:"ar","quote_count":0,"favorited":false,"coordinates":null,"truncated":true,"tim

estamp_ms":"1514300024992","reply_count":0,"entities":{"urls":[{"display_url":"t

witter.com/i/web/status/9\u2026","indices":[117,140],"expanded_url":"https://twi

tter.com/i/web/status/945668962481770497","url":"https://t.co/QF9rY4gu0R"}],"has

htags":[],"user_mentions":[{"indices":[0,13],"screen_name":"Citizen_care","id_st

r":"827055815026737152","name":"خدمة العملاء","id":827055815026737152}],"symbols

":[]},"display_text_range":[14,140],"contributors":null,"user":{"utc_offset":nul

l,"friends_count":436,"profile_image_url_https":"https://pbs.twimg.com/profile_i

mages/3176612987/76cbf3f8800aecf0a7f4a2d2fd8239e1_normal.jpeg","listed_count":0,

"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png

","default_profile_image":false,"favourites_count":601,"description":"سائح في ال

وجوه .. مقيم بالفضاء !!","created_at":"Fri Mar 23 19:56:30 +0000 2012","is_trans

lator":false,"profile_background_image_url_https":"https://abs.twimg.com/images/

themes/theme1/bg.png","protected":false,"screen_name":"yousef_aljasser","id_str"

:"534756504","profile_link_color":"1DA1F2","translator_type":"none","id":5347565

04,"geo_enabled":false,"profile_background_color":"C0DEED","lang":"en","profile_

sidebar_border_color":"C0DEED","profile_text_color":"333333","verified":false,"p

rofile_image_url":"http://pbs.twimg.com/profile_images/3176612987/76cbf3f8800aec

f0a7f4a2d2fd8239e1_normal.jpeg","time_zone":null,"url":null,"contributors_enable

d":false,"profile_background_tile":false,"statuses_count":1065,"follow_request_s

ent":null,"followers_count":250,"profile_use_background_image":true,"default_pro

file":true,"following":null,"name":"Yousef Al jasser","location":"أينما يوجد طري

ق ... ","profile_sidebar_fill_color":"DDEEF6","notifications":null}}

6- As you can see the above result in JSON, we want to see it as tabular format so query JSON values using Oracle supplied function.

SQL> SELECT jt.created_at FROM sys.xt_tweets_nfs,JSON_TABLE(json_line, '$' COLUMNS (created_at VARCHAR2(50 CHAR) PATH '$.created_at')) jt where rownum < 2;

CREATED_AT

-------------------------------

Tue Dec 26 14:53:44 +0000 2017

You have successfully parsed the JSON tweet for one column only. Now we will create the view for all the related columns we are interested.

create or replace view sys.vw_ora_tweets

SELECT SYS_GUID() sys_guid,DBMS_LOB.getlength(json_line) AS blob_length,

substr(jt.created_at,1,20) created_at,

jt.source,

jt.screen_name,

jt.name,

jt.text ,

jt.lang,

id_str,

geo,

favorite_count,

place,

favorited,

coordinates,

truncated,

friends_count,

profile_image_url_https,

profile_background_image_url,

user_description,

user_created_at,

user_protected,

user_lang,

user_time_zone,

user_location

FROM sys.xt_tweets_nfs,JSON_TABLE

(json_line, '$'

COLUMNS (

created_at VARCHAR2(50 CHAR) PATH '$.created_at',

source VARCHAR2(512 CHAR) PATH '$.source',

screen_name VARCHAR2(50 CHAR) PATH '$.user.screen_name',

name VARCHAR2(50 CHAR) PATH '$.user.name',

text VARCHAR2(512 CHAR) PATH '$.text',

lang VARCHAR2(10 CHAR) PATH '$.lang',

id_str VARCHAR2(50 CHAR) PATH '$.id_str',

geo VARCHAR2(50 CHAR) PATH '$.geo',

favorite_count VARCHAR2(50 CHAR) PATH '$.favorite_count',

place VARCHAR2(50 CHAR) PATH '$.place',

favorited VARCHAR2(50 CHAR) PATH '$.favorited',

coordinates VARCHAR2(50 CHAR) PATH '$.coordinates',

truncated VARCHAR2(50 CHAR) PATH '$.truncated',

friends_count VARCHAR2(50 CHAR) PATH '$.user.friends_count',

profile_image_url_https VARCHAR2(50 CHAR) PATH '$.user.profile_image_url_https',

profile_background_image_url VARCHAR2(50 CHAR) PATH '$.user.profile_background_image_url',

user_description VARCHAR2(50 CHAR) PATH '$.user.description',

user_created_at VARCHAR2(50 CHAR) PATH '$.user.created_at',

user_protected VARCHAR2(50 CHAR) PATH '$.user.protected',

user_lang VARCHAR2(50 CHAR) PATH '$.user.lang',

user_time_zone VARCHAR2(50 CHAR) PATH '$.user.time_zone',

user_location VARCHAR2(50 CHAR) PATH '$.user.location'

)

) jt

Now you can run all of your analytics on this view.

select * from sys.vw_ora_tweets

7- If you have more and more tweet data and you face performance issue with external table, you could create partitioned external table. I created the below one and played a trick by using a virtual column to separate the large tweet data files for related partitions. Using the queries with virtual column you will be querying only specific data files though external table.

drop table xtp_tweets_nfs;

CREATE TABLE xtp_tweets_nfs (

JSON_LINE CLOB,

VIRPART NUMBER

)

ORGANIZATION EXTERNAL

(

TYPE ORACLE_LOADER

DEFAULT DIRECTORY NFS_TWEET_DIR

ACCESS PARAMETERS

(

RECORDS DELIMITED BY NEWLINE

BADFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.bad'

LOGFILE ADMIN_LOG_DIR: 'xt_tweets_nfs%a_%p.log'

DISCARDFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.disc'

DNFS_ENABLE

DNFS_READBUFFERS 10

FIELDS TERMINATED BY '<FT>'

OPTIONALLY ENCLOSED BY '"'

MISSING FIELD VALUES ARE NULL

(

json_line char(50000),

"VIRPART" CHAR

)

PARALLEL 1

REJECT LIMIT 10

PARTITION BY LIST (VIRPART) (

PARTITION PARTITION1 VALUES (1) LOCATION ('AllTweetsP1.json'),

PARTITION PARTITION2 VALUES (2) LOCATION ('AllTweetsP2.json')

);

SQL> alter table xtp_tweets_nfs parallel (degree 8);

Table altered.

--Test Partitions

[root@te1-hdp-rp-en01 twitter]# cat /data/hdfsloc/flume/twitter/FlumeData.1515394213478 > AllTweetsP2.json

[root@te1-hdp-rp-en01 twitter]# ls All*

AllTweetsP1.json AllTweetsP2.json

SQL> select count(*) from xtp_tweets_nfs where virpart=1;

COUNT(*)

----------

388349

Elapsed: 00:00:54.19

SQL> select count(*) from xtp_tweets_nfs where virpart=2;

COUNT(*)

----------

Elapsed: 00:00:00.10

Using Materialized Views to Improve Performance

A materialized view is a database object that contains the results of a query. The FROM clause of the query can name tables, views, and other materialized views. Collectively these objects are called master tables (a replication term) or detail tables (a data warehousing term). This reference uses "master tables" for consistency. The databases containing the master tables are called the master databases.

Materialized views are a really useful performance feature, allowing you to pre-calcuate joins and aggregations, which can make applications and reports feel more responsive.

Suppose you have an application that requires the total number of tweets by each user. You could create a materialized view as below for fast response.

select to_char(SYSDATE ,'DD-MON-YYYY hh:mi:ss'),to_char((SYSDATE + 1/1440 ),'DD-MON-YYYY hh:mi:ss') from dual;

drop MATERIALIZED VIEW mv_complete_xt_tweets_nfs;

CREATE MATERIALIZED VIEW mv_complete_xt_tweets_nfs

TABLESPACE users

REFRESH COMPLETE

START WITH SYSDATE

NEXT SYSDATE + 1/1440

select screen_name,count(*) tot_tweets_by_user

from sys.vw_ora_tweets group by screen_name

SQL> CREATE MATERIALIZED VIEW mv_complete_xt_tweets_nfs

TABLESPACE users

REFRESH COMPLETE

START WITH SYSDATE

NEXT SYSDATE + 1/1440 AS

select screen_name,count(*) tot_tweets_by_user

from sys.vw_ora_tweets group by screen_name 2 3 4 5 6 7

8 /

Materialized view created.

Elapsed: 00:00:19.12

SQL> desc mv_complete_xt_tweets_nfs

Name Null? Type

----------------------------------------------------------------- -------- -------

SCREEN_NAME VARCHAR2(200)

TOT_TWEETS_BY_USER NUMBER

--Query materialized view

select substr(screen_name,1,20) screen_name, tot_tweets_by_user

from mv_complete_xt_tweets_nfs

where tot_tweets_by_user > 500

order by tot_tweets_by_user desc;

SQL> select substr(screen_name,1,20) screen_name, tot_tweets_by_user

from mv_complete_xt_tweets_nfs

where tot_tweets_by_user > 500

order by tot_tweets_by_user desc; 2 3 4

SCREEN_NAME TOT_TWEETS_BY_USER

------------------------------------------------------------ ------------------

misrstars 758

Citizen_care 542

Elapsed: 00:00:00.01

SQL> select sum(tot_tweets_by_user) from mv_complete_xt_tweets_nfs;

SUM(TOT_TWEETS_BY_USER)

-----------------------

402819

Elapsed: 00:00:00.02

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Monday, January 08, 2018

Processing Twitter (JSON) Data in Oracle (12c External Table)

No comments:

Translate

Followers

Labels

Blog Archive

About Me

Total Pageviews