1.Is there any max limit on key size of hbase table. 2.Is multiple small tables vs one large table which one is preferred. 3.for bulk load -when LoadIncremantalHfile is run it again recalculates the region splits based on region boundary - is this division happens on client side or server side again at region server or hbase master and then it assigns the splits which cross target region boundary to desired regionserver.
1.Is hbase.client.keyvalue.maxsize is max size of row or key only ? Is there any limit on key size only ? 2.Access pattern is mostly on key based only- Is memstores and regions on a regionserver are per table basis? Is it if I have multiple tables it will have multiple memstores instead of few if it would have been one large table ? On Mon, Aug 17, 2015 at 7:29 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
1.so size limit is per cell's identifier + value ?
What is more optimise - to have field in key or in column family's column ? If pattern is like every row has that field.
Say I have a field F1 in all rows so Situtatio -1 key1#F1(as composite key) - and rest fields in column
Situation-2 key1 as key and F1 part of column family. This is the main reason I asked the key size limit. If I asked for no of rows where F1 is = 'someval' will it be faster in situation-1 than in situation-2. Since in 1 it can return the result just by traversing keys no need to read columns? On Mon, Aug 17, 2015 at 8:27 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
1.Say if requirement is to count distinct value of F1-
If field is part of key- is hbase can't just scan key and skip value deserialsation and return result to client which will calculate distinct and in second approcah Hbase will desrialise the value of return column containing F1 to cleint which will calculate the distinct.
2.For bulk load when LoadIncrementalHFiles runs and regionserver moves the hfiles from hdfs to region directory - does regionserver localise the hfile by downloading it to local and then uploading again in region directory? Or it just moves to to region directory and wait for next compaction to get it localise as in regionserver failure case? On Mon, Aug 17, 2015 at 11:00 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
for hbase key containing time as prefix say(yyyy-mm-dd#other fields of guid base) I am using bulk load to avoid hot spot of regionserver (avoiding write to WAL).
What should be the initial splits of regions. Say I have 30 regionserves.
shall intial 30 days as intial splits and then auto split takes care of splitting regions if it grows further will serve ? Or since if it has date as prefix and when region is split in 2 from midway - and new data will come for increasing date only will lead to one region to be half filled always and rest half never filled?
On Tue, Aug 18, 2015 at 9:41 PM, anil gupta <[EMAIL PROTECTED]> wrote:
When last region gets new data and split in two - what is the split point - say last reagion was having 10 files and split alogorithm decided to split this region-
Will the two children regions have 5-5 files or the key space of original region(parent region) say have range (2015-08-01#guid to 2015-08-06#guid) will be divided to 2 equal parts child1 has (2015-08-01#guid to 2015-08-03#guids) and child2 has range (2015-08-04#guid to 2015-08-06#guid) and all data is rewritten in child regions to accomany this key range and then since its time series based so new data will come in increasing dates and for dates>2015-08-06 only so will go to child2 and child1 wil always be half filled. And child2 only will lead to new splits when reached split size threshold. On Wed, Aug 19, 2015 at 4:16 AM, Ted Yu <[EMAIL PROTECTED]> wrote: