Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, January 08, 2018

Processing Twitter (JSON) Data in Oracle (12c External Table)


Problem:


  • We have live Twitter stream data ingested by Flume to our Hadoop cluster. 
  • Flume is generating too many files in HDFS, 2 files in 1 second about 172k files in a day.
  • We have to process the Flume generated twitter JSON files.
  • Created an Oracle external table over twitter JSON files but performance is too bad because of too many files.
  • Need a remedy for the above issues.


Pre-requisites:



Process JSON Data


JSON is a popular textual data format used for exchanging data in modern web and mobile applications. JSON is also used for storing unstructured data in log files or NoSQL databases. 


Oracle's JSON_TABLE function creates a relational view of JSON data. It maps the result of a JSON data evaluation into relational rows and columns. You can query the result returned by the function as a virtual relational table using SQL. The main purpose of JSON_TABLE is to create a row of relational data for each object inside a JSON array and output JSON values from within that object as individual SQL column values.

The JSON_TABLE function incorporates all the functionality of JSON_VALUE, JSON_EXISTS and JSON_QUERY. The syntax is a little more complicated that using the individual JSON functions, but it is more efficient to use a single JSON_TABLE call than combining multiple calls to the other individual functions into a single query.

You can specify JSON_TABLE only in the FROM clause of a SELECT statement. The function first applies a JSON path expression, called a row path expression, to the supplied JSON data. The JSON value that matches the row path expression is called a row source in that it generates a row of relational data. The COLUMNS clause evaluates the row source, finds specific JSON values within the row source, and returns those JSON values as SQL values in individual columns of a row of relational data.


1- First check the files on the HDFS location (NFS location), You will observe many small files created by Flume. 

--get the files in order and observe the file name pattern
[root@te1-hdp-rp-en01 bin]# ls -al /data/hdfsloc/flume/twitter | more

2- Merge the small files into one large file, later we will create the external table over this large file which will have better performance rather than external table with too many small files.

[root@te1-hdp-rp-en01 twitter]# cat /data/hdfsloc/flume/twitter/FlumeData.1514* >> AllTweetsP1.json

I used above command after observing the file name pattern, this way I was able to track which files i've merged into the large file and which ones still left.

You can query the .tmp file generated by Flume to track your Flume generated file name pattern. Then you can merge the large file keeping track. 

[root@te1-hdp-rp-en01 twitter]# ls /data/hdfsloc/flume/twitter/*.tmp

FlumeData.1515318015287.tmp


3- After merging the small files into large one , you could delete the small files as below.

rm -f /data/hdfsloc/flume/twitter/FlumeData.1514*

4- Create external table now on the large file. As one tweet data is more than 4000 char, create the column with CLOB data type. I'm going to create the table with only one column which will have raw tweet as it is as received by Flume agent.

CREATE OR REPLACE DIRECTORY nfs_tweet_dir AS '/data/hdfsloc/flume/twitter/';

CREATE OR REPLACE DIRECTORY admin_log_dir AS '/data/log/'

drop table xt_tweets_nfs;

CREATE TABLE xt_tweets_nfs (
 JSON_LINE CLOB
)
ORGANIZATION EXTERNAL
(
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY NFS_TWEET_DIR
  ACCESS PARAMETERS
  (
    RECORDS DELIMITED BY NEWLINE
BADFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.bad'
LOGFILE ADMIN_LOG_DIR: 'xt_tweets_nfs%a_%p.log'
    DISCARDFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.disc'
DNFS_ENABLE
    DNFS_READBUFFERS 10
FIELDS TERMINATED BY '<FT>' 
    OPTIONALLY ENCLOSED BY '"'
    MISSING FIELD VALUES ARE NULL
    (
      json_line char(50000)
    )
  )
  LOCATION (NFS_TWEET_DIR:'AllTweetsP1.json')
)
PARALLEL 1
REJECT LIMIT 10;

SQL> alter table xt_tweets_nfs parallel (degree 8);


Table altered.

5- Check the one line from your large JSON file to see the tweet data and its format and then run count query external table to know how many tweets you got in your large file.


SQL> host head -1 /data/hdfsloc/flume/twitter/AllTweetsP1.json



SQL> select count(*) from xt_tweets_nfs;

  COUNT(*)
----------
    388349


Query the tweet data using external table 

SQL> set long 30000
SQL> set longchunksize 30000
SQL> set pages 100
SQL> select json_line from xt_tweets_nfs where rownum < 2;

JSON_LINE
--------------------------------------------------------------------------------
{"extended_tweet":{"entities":{"urls":[],"hashtags":[],"user_mentions":[{"indice
s":[0,13],"screen_name":"Citizen_care","id_str":"827055815026737152","name":"خدم
ة العملاء","id":827055815026737152},{"indices":[173,182],"screen_name":"AliRajhi
","id_str":"876924188963024897","name":"علي بن هادي راجحي","id":8769241889630248
97}],"symbols":[]},"full_text":"@Citizen_care لم أستفسر عن مقدار الإستحقاق؟ \nإس
تفساري ما هو وضعي بالضبط؟ \nهل أنا مستحق أو لا؟ \n و في حال إستحقاقي لماذا لم يص
رف لي؟ \nو في حال عدم إستحقاقي لماذا غير مستحق؟ \n@AliRajhi","display_text_range
":[14,182]},"in_reply_to_status_id_str":"945629575689752576","in_reply_to_status
_id":945629575689752576,"created_at":"Tue Dec 26 14:53:44 +0000 2017","in_reply_
to_user_id_str":"827055815026737152","source":"<a href=\"http://twitter.com/down
load/iphone\" rel=\"nofollow\">Twitter for iPhone<\/a>","retweet_count":0,"retwe
eted":false,"geo":null,"filter_level":"low","in_reply_to_screen_name":"Citizen_c
are","is_quote_status":false,"id_str":"945668962481770497","in_reply_to_user_id"
:827055815026737152,"favorite_count":0,"id":945668962481770497,"text":"@Citizen_
care لم أستفسر عن مقدار الإستحقاق؟ \nإستفساري ما هو وضعي بالضبط؟ \nهل أنا مستحق
أو لا؟ \n و في حال إستحقاقي ل\u2026 https://t.co/QF9rY4gu0R","place":null,"lang"
:"ar","quote_count":0,"favorited":false,"coordinates":null,"truncated":true,"tim
estamp_ms":"1514300024992","reply_count":0,"entities":{"urls":[{"display_url":"t
witter.com/i/web/status/9\u2026","indices":[117,140],"expanded_url":"https://twi
tter.com/i/web/status/945668962481770497","url":"https://t.co/QF9rY4gu0R"}],"has
htags":[],"user_mentions":[{"indices":[0,13],"screen_name":"Citizen_care","id_st
r":"827055815026737152","name":"خدمة العملاء","id":827055815026737152}],"symbols
":[]},"display_text_range":[14,140],"contributors":null,"user":{"utc_offset":nul
l,"friends_count":436,"profile_image_url_https":"https://pbs.twimg.com/profile_i
mages/3176612987/76cbf3f8800aecf0a7f4a2d2fd8239e1_normal.jpeg","listed_count":0,
"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png
","default_profile_image":false,"favourites_count":601,"description":"سائح في ال
وجوه .. مقيم بالفضاء !!","created_at":"Fri Mar 23 19:56:30 +0000 2012","is_trans
lator":false,"profile_background_image_url_https":"https://abs.twimg.com/images/
themes/theme1/bg.png","protected":false,"screen_name":"yousef_aljasser","id_str"
:"534756504","profile_link_color":"1DA1F2","translator_type":"none","id":5347565
04,"geo_enabled":false,"profile_background_color":"C0DEED","lang":"en","profile_
sidebar_border_color":"C0DEED","profile_text_color":"333333","verified":false,"p
rofile_image_url":"http://pbs.twimg.com/profile_images/3176612987/76cbf3f8800aec
f0a7f4a2d2fd8239e1_normal.jpeg","time_zone":null,"url":null,"contributors_enable
d":false,"profile_background_tile":false,"statuses_count":1065,"follow_request_s
ent":null,"followers_count":250,"profile_use_background_image":true,"default_pro
file":true,"following":null,"name":"Yousef Al jasser","location":"أينما يوجد طري
ق ... ","profile_sidebar_fill_color":"DDEEF6","notifications":null}}


6- As you can see the above result in JSON, we want to see it as tabular format so query JSON values using Oracle supplied function.

SQL> SELECT jt.created_at FROM sys.xt_tweets_nfs,JSON_TABLE(json_line, '$' COLUMNS (created_at    VARCHAR2(50 CHAR) PATH '$.created_at')) jt where rownum < 2;

CREATED_AT
-------------------------------
Tue Dec 26 14:53:44 +0000 2017

You have successfully parsed the JSON tweet for one column only. Now we will create the view for all the related columns we are interested.


create or replace view sys.vw_ora_tweets
as
SELECT SYS_GUID() sys_guid,DBMS_LOB.getlength(json_line) AS blob_length,
substr(jt.created_at,1,20) created_at,
jt.source,
jt.screen_name,
jt.name,
jt.text ,
jt.lang,
  id_str,
  geo,
  favorite_count,
  place,
  favorited,
  coordinates,
  truncated,
  friends_count,
  profile_image_url_https,
  profile_background_image_url,
  user_description,
  user_created_at,
  user_protected,
  user_lang,
  user_time_zone,
  user_location
FROM sys.xt_tweets_nfs,JSON_TABLE
(json_line, '$' 
COLUMNS (
created_at VARCHAR2(50 CHAR) PATH '$.created_at',
source VARCHAR2(512 CHAR) PATH '$.source',
screen_name VARCHAR2(50 CHAR) PATH '$.user.screen_name',
name VARCHAR2(50 CHAR) PATH '$.user.name',
text VARCHAR2(512 CHAR) PATH '$.text',
lang VARCHAR2(10 CHAR) PATH '$.lang',
            id_str VARCHAR2(50 CHAR) PATH '$.id_str',
            geo VARCHAR2(50 CHAR) PATH '$.geo',
            favorite_count VARCHAR2(50 CHAR) PATH '$.favorite_count',
            place VARCHAR2(50 CHAR) PATH '$.place',
            favorited VARCHAR2(50 CHAR) PATH '$.favorited',
            coordinates VARCHAR2(50 CHAR) PATH '$.coordinates',
            truncated VARCHAR2(50 CHAR) PATH '$.truncated',
            friends_count VARCHAR2(50 CHAR) PATH '$.user.friends_count',
            profile_image_url_https VARCHAR2(50 CHAR) PATH '$.user.profile_image_url_https',
            profile_background_image_url VARCHAR2(50 CHAR) PATH '$.user.profile_background_image_url',
            user_description VARCHAR2(50 CHAR) PATH '$.user.description',
            user_created_at VARCHAR2(50 CHAR) PATH '$.user.created_at',
            user_protected VARCHAR2(50 CHAR) PATH '$.user.protected',
            user_lang VARCHAR2(50 CHAR) PATH '$.user.lang',
            user_time_zone VARCHAR2(50 CHAR) PATH '$.user.time_zone',
            user_location VARCHAR2(50 CHAR) PATH '$.user.location'                     
)
) jt
/

Now you can run all of your analytics on this view.

select * from sys.vw_ora_tweets




7- If you have more and more tweet data and you face performance issue with external table, you could create partitioned external table. I created the below one and played a trick by using a virtual column to separate the large tweet data files for related partitions. Using the queries with virtual column you will be querying only specific data files though external table.


drop table xtp_tweets_nfs;



CREATE TABLE xtp_tweets_nfs (

JSON_LINE CLOB,
VIRPART  NUMBER
)
ORGANIZATION EXTERNAL
(
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY NFS_TWEET_DIR
  ACCESS PARAMETERS
  (
    RECORDS DELIMITED BY NEWLINE
BADFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.bad'
LOGFILE ADMIN_LOG_DIR: 'xt_tweets_nfs%a_%p.log'
    DISCARDFILE ADMIN_LOG_DIR:'xt_tweets_nfs%a_%p.disc'
DNFS_ENABLE
    DNFS_READBUFFERS 10
FIELDS TERMINATED BY '<FT>' 
    OPTIONALLY ENCLOSED BY '"'
    MISSING FIELD VALUES ARE NULL
    (
      json_line char(50000),
  "VIRPART" CHAR
    )
  )
)
PARALLEL 1
REJECT LIMIT 10
PARTITION BY LIST (VIRPART) (
  PARTITION PARTITION1    VALUES (1) LOCATION ('AllTweetsP1.json'),
  PARTITION PARTITION2    VALUES (2) LOCATION ('AllTweetsP2.json')
);

SQL> alter table xtp_tweets_nfs parallel (degree 8);

Table altered.

--Test Partitions

[root@te1-hdp-rp-en01 twitter]# cat /data/hdfsloc/flume/twitter/FlumeData.1515394213478 > AllTweetsP2.json

[root@te1-hdp-rp-en01 twitter]# ls All*
AllTweetsP1.json  AllTweetsP2.json

SQL> select count(*) from xtp_tweets_nfs where virpart=1;

  COUNT(*)
----------
    388349

Elapsed: 00:00:54.19

SQL> select count(*) from xtp_tweets_nfs where virpart=2;

  COUNT(*)
----------
         2

Elapsed: 00:00:00.10


Using Materialized Views to Improve Performance


A materialized view is a database object that contains the results of a query. The FROM clause of the query can name tables, views, and other materialized views. Collectively these objects are called master tables (a replication term) or detail tables (a data warehousing term). This reference uses "master tables" for consistency. The databases containing the master tables are called the master databases. 

Materialized views are a really useful performance feature, allowing you to pre-calcuate joins and aggregations, which can make applications and reports feel more responsive.
Suppose you have an application that requires the total number of tweets by each user. You could create a materialized view as below for fast response.

select to_char(SYSDATE ,'DD-MON-YYYY hh:mi:ss'),to_char((SYSDATE + 1/1440 ),'DD-MON-YYYY hh:mi:ss') from dual;

drop MATERIALIZED VIEW mv_complete_xt_tweets_nfs;

CREATE  MATERIALIZED VIEW mv_complete_xt_tweets_nfs
TABLESPACE users
REFRESH COMPLETE
START WITH SYSDATE
NEXT SYSDATE + 1/1440 
AS
select   screen_name,count(*) tot_tweets_by_user
from sys.vw_ora_tweets group by screen_name 

SQL> CREATE MATERIALIZED VIEW mv_complete_xt_tweets_nfs
TABLESPACE users
REFRESH COMPLETE
START WITH SYSDATE
NEXT SYSDATE + 1/1440 AS
select   screen_name,count(*) tot_tweets_by_user
from sys.vw_ora_tweets group by screen_name   2    3    4    5    6    7
  8  /

Materialized view created.

Elapsed: 00:00:19.12

SQL> desc mv_complete_xt_tweets_nfs

 Name                                                              Null?    Type
 ----------------------------------------------------------------- -------- -------
 SCREEN_NAME                                                                VARCHAR2(200)
 TOT_TWEETS_BY_USER                                                         NUMBER


--Query materialized view

select  substr(screen_name,1,20) screen_name, tot_tweets_by_user 
from mv_complete_xt_tweets_nfs
where tot_tweets_by_user > 500
order by tot_tweets_by_user desc;

SQL> select  substr(screen_name,1,20) screen_name, tot_tweets_by_user
from mv_complete_xt_tweets_nfs
where tot_tweets_by_user > 500
order by tot_tweets_by_user desc;  2    3    4

SCREEN_NAME                                                  TOT_TWEETS_BY_USER
------------------------------------------------------------ ------------------
misrstars                                                                  758
Citizen_care                                                               542

Elapsed: 00:00:00.01


SQL> select sum(tot_tweets_by_user) from mv_complete_xt_tweets_nfs;

SUM(TOT_TWEETS_BY_USER)
-----------------------
                 402819

Elapsed: 00:00:00.02

No comments: